Understanding Cluster Analysis Methods
Understanding Cluster Analysis Methods
Dr V V Haragopal
1
Cluster Analysis
What is it?
Arranging objects into groups is a natural and
necessary skill that we all share.
Try placing the following faces into groups. What criteria do you apply: size, hat,
glasses, gender, hair colour? Is there a different classification that makes as
much sense?
case sex glasses moustache smile hat
1 m y n y n
2 f n n y n
3 m y n n n
4 m n n n n
5 m n n y? n
6 m n y n y
7 m y n y n
8 m n n y n
9 m y y y n
10 f n n n n
11 m n y n n
12 f n n n n
Cluster Analysis: Basic Concepts and
Methods
7
Clustering for Data Understanding
and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
8
Clustering as a Preprocessing Tool
(Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
9
Quality: What Is Good
Clustering?
A good clustering method will produce high quality
clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
10
Measure the Quality of
Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
11
Considerations for Cluster
Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
12
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
these
Constraint-based clustering
User may give inputs on constraints
High dimensionality
13
Major Clustering Approaches
(I)
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Grid-based approach:
based on a multiple-level granularity structure
14
Major Clustering Approaches
(II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Link-based clustering:
Objects are often linked together in various ways
15
Cluster Analysis: Basic Concepts and
Methods
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion.
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
17
The K-Means Clustering Method
18
An Example of K-Means Clustering
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update
the
point) for each partition cluster
Assign each object to the centroids
cluster of its nearest centroid
Until no change
19
Comments on the K-Means Method
21
What Is the Problem of the K-Means
Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
22
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5 each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
1 1
initial to
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping 7 total cost 7
Until no O and 6
of 6
Oramdom
change
5 5
4
swapping 4
If quality is 3
2
3
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
The K-Medoid Clustering Method
25
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
26
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus,SPSS,etc.,
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
27
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
28
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
29
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)
31
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2),
where n is the number of total objects.
32
Person Food Clothing
X1 X2
A 2 4
AN EXAMPLE : E 8.5 1
Cluster analysis embraces a variety of techniques,
the main objective of which is to group
observations or variables into homogeneous and
distinct clusters. A simple numerical example will
help explain these objectives.
The daily expenditures on food (X1) and clothing
(X2) of five persons are shown in the adjacent
Table 1 (Illustrative data):
The Data is Fictitious and not at all realistic, but the example will help us explain the
essential features of cluster analysis as simply as possible. The data of Table 1 are
plotted in Figure 1. The numbers are fictitious and not at all realistic, but the example
will help us explain the essential features of cluster analysis as simply as possible.
Inspection of Figure 1 suggests that the Five observations form two clusters. The first
consists of persons a and d, and the second of b, c and e. It can be noted that the
observations in each cluster are similar to one another with respect to expenditures
on food and clothing, and that the two clusters are quite distinct from each other.
These conclusions concerning the number of clusters and their membership were
reached through a visual inspection of Figure 1. This inspection was possible
because only two variables were in grouping the observations. The question is : Can
a procedure for similar grouping of observations when there are more than two
variables or Attributes?
Figure 1
Measures of Distance For Variables:
Clustering methods require a more precise definition of “similarity" (“closeness",
“proximity") of observations and clusters.
When the grouping is based on variables, it is natural to employ the
familiar concept of distance. Consider below Figure as a map showing two
points, i and j, with coordinates (X1i;X2i) and (X1j ;X2j), respectively.
At each step, keep track of the distance at which the clusters are formed. In order to
determine the number of clusters, consider the step(s) at which the merging distance is
relatively large.
The three clusters remaining at this step and the distances between these
clusters are shown in Figure 3(a). We merge (be) with c to form the
cluster (bce) shown in Figure 3(b).
The nearest clusters are (a) and (d), which are now grouped into the
cluster (ad). The remaining steps are similarly executed.
We notice here that in this Problem 1 that the nearest and
furthest neighbour methods produce the same results in this illustration. In
other cases, however, the two methods may not agree.
Consider Figure 8 (a)in the next slide as an example. The nearest
neighbour method will probably not form the two groups perceived
by the naked eye. This is so because at some intermediate step
the method will probably merge the two “nose" points joined in
Figure (a) into the same cluster, and proceed to string along the
remaining points in chain-link fashion. The furthest neighbour
method, will probably identify the two clusters because it tends to
resist merging clusters the elements of which vary substantially in
distance from those of the other cluster. On the other hand, the
nearest neighbour
method will probably succeed in forming the two groups marked in
Figure(b), but the furthest neighbour method will probably not.
Figure 8 -Two cluster patterns
Figure: Cluster distance, average linkage method
compromise method is average linkage, under which the distance between two
clusters is the average of the distances of all pairs of observations, one
observation in the pair taken from the first cluster and the other from the
second cluster as shown in Figure below:
The three methods examined so far are examples of
“hierarchical agglomerative clustering methods”.
“Hierarchical" because all clusters formed by these
methods consist of mergers of previously formed
clusters. “Agglomerative"
because the methods begin with as many clusters as
there are observations and end with a single cluster
containing all observations.
A)Distance Measures For Attributes(To continue)
B)Grouping Variables.(To continue)
Distance Measure and Metric
1. Euclidean distance
Minkowski Metric
Minkowski Metric
Distance Measure and Metric
Mahalanobis distance
The Mahalanobis distance is a distance measure
introduced by P. C. Mahalanobis in 1936.
It is based on correlations between variables by which
different patterns can be identified and analysed.
is defined as:
Distance Measure and Metric
Mahalanobis distance
Mahalanobis distance
If the covariance matrix is the identity matrix then it is the same as Euclidean
Euclidean distance:
1 1 1 0
2 0 1 1
3 1 0 1
4 0 0 0
2
Grouping Variables Clustering:
Occasionally, clustering methods are applied to group variables rather than
observations. One situation where such a grouping is desirable is the design of
questionnaires. The First draft of a questionnaire often contains more questions than is
prudent to ensure a good response rate. When the draft questionnaire is tested on a
small number of respondents it may be observed that the responses to certain groups of
questions are highly correlated. Clustering analysis may be applied to identify groups of
questions that are similar to one another, in the sense that the answers to these
questions are correlated. Then, in the final form of the questionnaire only one of the
questions in each cluster of similar questions may be used as representative of all the
questions in the cluster.
For example, consider the following responses to three questions by five respondents to
the first draft of a questionnaire:
Respondent Q1 Q2 Q3
a 10 5 3
b 30 7.5 3.1
c 20 6 2.9
d 40 8 2.95
Contd…
The correlation coefficient, r, of Q1 and Q2 can be shown to be 0.984,that of Q1 and Q3
0.076, and that of Q2 and Q3 0.230. A measure of the “distance" (dissimilarity) between
two questions is 1 - r, and the starting table of distances between all pairs of questions
is: Any clustering method can now be applied to this table in the usual manner.
Correlations
Variable Q1 Q2 Q3
Q1 1 0.984 0.076
Q2 1 0.230
Q3 1
Variable Q1 Q2 Q3
Q1 0 0.016 0.924
Q2 0 0.770
Q3 0
To Sum Up……..
67