0% found this document useful (0 votes)
10 views67 pages

Understanding Cluster Analysis Methods

Uploaded by

shagun.ecomm07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views67 pages

Understanding Cluster Analysis Methods

Uploaded by

shagun.ecomm07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Cluster Analysis

Dr V V Haragopal

1
Cluster Analysis

The term cluster analysis (first used by Tryon,


1939) encompasses a number of different
algorithms and methods for grouping objects
of similar kind into respective categories

What is it?
Arranging objects into groups is a natural and
necessary skill that we all share.
Try placing the following faces into groups. What criteria do you apply: size, hat,
glasses, gender, hair colour? Is there a different classification that makes as
much sense?
case sex glasses moustache smile hat

1 m y n y n

2 f n n y n

3 m y n n n

4 m n n n n

5 m n n y? n

6 m n y n y

7 m y n y n

8 m n n y n

9 m y y y n

10 f n n n n

11 m n y n n

12 f n n n n
Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
6
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)


 Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

7
Clustering for Data Understanding
and Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
8
Clustering as a Preprocessing Tool
(Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

9
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

10
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
11
Considerations for Cluster
Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
12
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes


 Numerical, binary, categorical, ordinal, linked, and mixture of

these
 Constraint-based clustering
 User may give inputs on constraints

 Use domain knowledge to determine input parameters

 Interpretability and usability


 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

13
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids,

 Hierarchical approach:

 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Density-based approach:
 Based on connectivity and density functions

 Grid-based approach:
 based on a multiple-level granularity structure

14
Major Clustering Approaches
(II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Link-based clustering:
 Objects are often linked together in various ways

15
Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
16
Partitioning Algorithms: Basic
Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion.
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
17
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four


steps:
 Partition objects into k non-empty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

18
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update
the
point) for each partition cluster
 Assign each object to the centroids
cluster of its nearest centroid
 Until no change
19
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is


# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ),

Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
20
Variations of the K-Means Method

 Most of the variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

21
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5 each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

1 1
initial to
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping 7 total cost 7

Until no O and 6
of 6

Oramdom
change
5 5

4
swapping 4

If quality is 3

2
3

improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters


 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 PAM on samples
 Randomized re-sampling
24
Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

25
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
26
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus,SPSS,etc.,
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

27
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster

28
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus,SPSS etc.,
 Inverse order
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

29
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,


Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)

Medoid: a chosen, centrally located object in the cluster
30
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point


of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

31
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2),
where n is the number of total objects.

32
Person Food Clothing
X1 X2

A 2 4

Concept Of Cluster Analysis: B 8 2


Cluster analysis is often used in conjunction
with other analyses (such as discriminant
analysis). The User must be able to C 9 3
interpret the cluster analysis based on their
understanding of the data to determine if
the results produced by the analysis are D 1 5
actually meaningful.

AN EXAMPLE : E 8.5 1
Cluster analysis embraces a variety of techniques,
the main objective of which is to group
observations or variables into homogeneous and
distinct clusters. A simple numerical example will
help explain these objectives.
The daily expenditures on food (X1) and clothing
(X2) of five persons are shown in the adjacent
Table 1 (Illustrative data):
The Data is Fictitious and not at all realistic, but the example will help us explain the
essential features of cluster analysis as simply as possible. The data of Table 1 are
plotted in Figure 1. The numbers are fictitious and not at all realistic, but the example
will help us explain the essential features of cluster analysis as simply as possible.
Inspection of Figure 1 suggests that the Five observations form two clusters. The first
consists of persons a and d, and the second of b, c and e. It can be noted that the
observations in each cluster are similar to one another with respect to expenditures
on food and clothing, and that the two clusters are quite distinct from each other.

These conclusions concerning the number of clusters and their membership were
reached through a visual inspection of Figure 1. This inspection was possible
because only two variables were in grouping the observations. The question is : Can
a procedure for similar grouping of observations when there are more than two
variables or Attributes?
Figure 1
Measures of Distance For Variables:
Clustering methods require a more precise definition of “similarity" (“closeness",
“proximity") of observations and clusters.
When the grouping is based on variables, it is natural to employ the
familiar concept of distance. Consider below Figure as a map showing two
points, i and j, with coordinates (X1i;X2i) and (X1j ;X2j), respectively.

Fig2:Distance measures illustrated


The Euclidean distance between the two points is the hypotenuse of the
triangle ABC:
D(I, j) =Sqrt( A^2 + B^2 ) = sqrt((X1i - X1j)^2 +(X2i - X2j )^2
An observation i is declared to be closer (more similar) to j than to observation
k if D(I, j) < D(I,k).
An alternative measure is the squared Euclidean distance. In Figure 2 given
above the squared distance between the two points i and j is
D2(I, j) = A^2 + B^2 = (X1i - X1j)^2 + (X2i - X2j)^2.
Yet another measure is the city block distance, defined as
D3(I, j) = |A|+ |B| = |X1i - X1j |+|X2i - X2j |.
As the name suggests, it is the distance one would travel if the points i and j
were located at opposite corners of a city block.
The distance measures can be extended to more than two variables. For
example, the Euclidean distance between an observation (X1i,X2i, …….Xki)
and another (X1j ,X2j ……..Xkj) is :
Contd…..

All three measures of distance depend on the units in


which X1 and X2 are measured, and are influenced by
whichever variable takes numerically larger values. For
this reason, the variables are often standardized so that
they have mean 0 and variance 1 before cluster analysis
is applied. Alternatively, weights (w1, w2, …..,
wk )rejecting the importance of the variables
could be used and a weighted measure of distance
calculated.
CLUSTERING METHODS:

Given a distance measure, a reasonable procedure for grouping ‘n’ observations


proceeds in the following steps.
Begin with as many clusters as there are observations, that is, with each
observation forming a separate cluster. Merge that pair of observations that are
nearest to one another, leaving n - 1 clusters for the next step. Next, merge into
one cluster that pair of clusters that are nearest to one another, Leaving( n -
2 )clusters for the next step. Continue in this fashion, reducing the number of
clusters by one at each step, until a single cluster is formed consisting of all n
observations.

At each step, keep track of the distance at which the clusters are formed. In order to
determine the number of clusters, consider the step(s) at which the merging distance is
relatively large.

A problem with this procedure is how to measure the distance between


clusters consisting of two or more observations. Perhaps the simplest method
is to treat the distance between the two nearest observations, one from each
cluster, as the distance between the two clusters. This is known as the
nearest neighbour (or single linkage) method. Figure 3 illustrates.
Fig3-Cluster distance, nearest neighbour method.

Let us suppose that Euclidean distance is the appropriate measure of


Proximity. We begin with each of the five observations forming its own
cluster. The distance between each pair of observations is shown in the next
slide.
For example, the distance between a and b is Sqrt [(2- 8)2 + (4- 2)2] = Sqrt(36
+4 )
=6.325

Observations b and e are nearest (most similar) and, as shown in Figure


2(b), are grouped in the same cluster.
Assuming the nearest neighbour method is used, the distance between
the cluster (b e) and another observation is the smaller of the distances
between
that observation, on the one hand, and b and e, on the other. For example
D(be;a) = min ( D(b,a),D(e, a)) = min(6.325, 7.159)= 6.325.
The four clusters remaining at the end of this step and the distances
between these clusters are shown in Figure (a).
Two pairs of clusters are closest to one another at distance 1.414; these
are (ad) and (bce). We arbitrarily select (a,d) as the new cluster, as shown
in Figure 2(b).
The distance between (be) and (ad) is:

D(be; ad) = min( D(be,a),D(be, d)) = min(6:325,7:616) = 6:325;

while that between c and (ad) is


D(c,ad) = min(D(c,a),D(c, d)) = min(7.071,8:246) = [Link]

The three clusters remaining at this step and the distances between these
clusters are shown in Figure 3(a). We merge (be) with c to form the
cluster (bce) shown in Figure 3(b).

The distance between the two remaining clusters is:

D(ad,bce) = min(D(ad,be),D(ad, c)) = min(6.325,7.071) = 6.325:

The grouping of these two clusters, it will be noted, occurs at a distance


of 6.325, a much greater distance than that at which the earlier groupings
took place. Figure 4 shows the final grouping.

See the next for slide For figure 3


Figure 4 - Nearest neighbor
method,
The groupings and the distance at which these took place are also
shown in the tree diagram (dendrogram) of Figure 5.
One usually searches the dendrogram for large jumps in the
grouping distance as guidance in arriving at the number of groups.
In this illustration, it is clear that the elements in each of the
clusters (ad) and (bce) are close (they were merged at a small
distance), but the clusters are distant (the distance at which they
merge is large).
The nearest neighbour is not the only method for measuring the
distance between clusters. Under the furthest neighbour (or
complete linkage) method, the distance between two clusters is
the distance between their two most distant members. Figure 6
illustrates.
Figure 5 (Nearest neighbour method, dendrogram) and 6(Cluster
distance, furthest neighbour method)
Example Contd…..

Example 1 (Continued) The distances between all pairs of


observations shown in Figure 4 are the same as with the nearest
neighbour method. Therefore, the furthest neighbour method also
calls for grouping b and e at Step 1. However, the distances
between (be), on the one hand, and the clusters (a), (c), and (d),
on the other, are different:
D(be,a) = max(D(b,a),D(e, a)) = max(6.325, 7.159) = 7:159

D(be, c) = max(D(b, c),D(e, c)) = max(1.414, 2:062) = 2:062

D(be, d) = max(D(b,d),D(e,d)) = max(7.616,8.500) = 8.500

The four clusters remaining at Step 2 and the distances between


these clusters are shown in Figure 7(a).
Figure 7(Furthest neighbour method, Step 2)

The nearest clusters are (a) and (d), which are now grouped into the
cluster (ad). The remaining steps are similarly executed.
We notice here that in this Problem 1 that the nearest and
furthest neighbour methods produce the same results in this illustration. In
other cases, however, the two methods may not agree.
Consider Figure 8 (a)in the next slide as an example. The nearest
neighbour method will probably not form the two groups perceived
by the naked eye. This is so because at some intermediate step
the method will probably merge the two “nose" points joined in
Figure (a) into the same cluster, and proceed to string along the
remaining points in chain-link fashion. The furthest neighbour
method, will probably identify the two clusters because it tends to
resist merging clusters the elements of which vary substantially in
distance from those of the other cluster. On the other hand, the
nearest neighbour
method will probably succeed in forming the two groups marked in
Figure(b), but the furthest neighbour method will probably not.
Figure 8 -Two cluster patterns
Figure: Cluster distance, average linkage method

compromise method is average linkage, under which the distance between two
clusters is the average of the distances of all pairs of observations, one
observation in the pair taken from the first cluster and the other from the
second cluster as shown in Figure below:
The three methods examined so far are examples of
“hierarchical agglomerative clustering methods”.
“Hierarchical" because all clusters formed by these
methods consist of mergers of previously formed
clusters. “Agglomerative"
because the methods begin with as many clusters as
there are observations and end with a single cluster
containing all observations.
A)Distance Measures For Attributes(To continue)
B)Grouping Variables.(To continue)
Distance Measure and Metric
1. Euclidean distance
Minkowski Metric
Minkowski Metric
Distance Measure and Metric

Mahalanobis distance
The Mahalanobis distance is a distance measure
introduced by P. C. Mahalanobis in 1936.
It is based on correlations between variables by which
different patterns can be identified and analysed.

It is a useful way of determining similarity of an unknown


sample set to a known one.
It differs from Euclidean distance in that it takes into
account the correlations of the data set.
Distance Measure and Metric
Mahalanobis distance

The Mahalanobis distance from a group of values with mean

and covariance matrix Σ for a multivariate vector

is defined as:
Distance Measure and Metric

Mahalanobis distance

Mahalanobis distance can also be defined as


dissimilarity measure between two random
vectors x and y of the same distribution with
the covariance matrix Σ :
Distance Measure and Metric

Mahalanobis distance

If the covariance matrix is the identity matrix then it is the same as Euclidean

distance. If covariance matrix is diagonal, then it is called Normalized

Euclidean distance:

where σi is the standard deviation of the xi over the sample set.


DISTANCE MEASURES FOR ATTRIBUTES
The distance measures presented in and used in earlier examples must be modified if the
clustering of observations is based on attributes.
Consider, for example, the following description of four persons according to marital status (single,
married, other) and gender (male,female):
A reasonable measure of the similarity of two observations is the ratio of the number of matches
(identical categories) to the number of attributes.
For example, since a and d are both single and female, the similarity measure is 2/2 or 1; b and c
do not have the same marital status but are both male, so the similarity measure is 1/2. To be
consistent with earlier measures,however, we use instead:
Obs. Marital status Gender
a Single Female
b Married Male
c Other Male
d Single Female
Da(i, j) = 1 – {(Number of matches)/(Number of attributes)}.
as the measure of “distance" (dissimilarity) of two observations i and j.
We declare two observations to be closer, the smaller this distance. The
distances between all pairs of observations in our example are as follows:
Obs. a b c d
_____________________
a 0 1 1 0
b 1 0 0.5 1
c 1 0.5 0 1
d 0 1 1 0
Any of the clustering methods described earlier can be applied to the above distances. For example, in
the first step of the nearest neighbor, furthest neighbor, or complete linkage methods a and d would be
grouped to form the first cluster. The remaining steps would be carried out in the usual fashion.
When the grouping is to be based on variables and attributes, thus, the simplest approach is to convert
the variables to attributes and then apply the measure Da(i , j) to the distance between any pair of
observations. For example, suppose that the four observations will be grouped according to marital
status, gender, and age:
Obs. Marital status Gender Age (yrs.) category
a Single Female 15 Y
b Married Male 30 M
c Other Male 60 O
d Single Female 32 M
We could make age an attribute with, say, three categories: Y (under 25 years old), M
(25 to 50), and O (more than 50 years old). The “distance“ between b and c, for
example, is:
Da(b, c) = 1 - (1/3)=2/3.
The distances between all pairs of observations are as follows:
Obs. a b c d
_____________________
a 0 1 1 1/3
b * 0 2/3 2/3
c * * 0 1
d * * * 0
Any clustering method can now be applied to this table of distances.
Example -1. A study was made to identify clusters of warehouse items that tended to be
ordered together. Items in the same cluster could be stored near one another in the
warehouse, so as to minimize the effort needed to select those required for particular
orders. The study involved a distributor of telecommunications products who stored
approximately 1,000 items and was filling approximately 75,000 orders per month on
average.
Contd…
Available was a history of K orders and the items that each order required. To measure
the “distance" between two items, a variable Vi for each item i was introduced such that
Vik = 1 if item i was required by a given order k, otherwise Vik = 0. The distance between
any pair of items i and j was defined as :
K
D(i, j) = ∑ | Vik - Vjk |
k=1
The following table illustrates the calculation of the distance for two items and a fictitious
history of four orders:
It is clear that smaller values of the distance measure, D(1, 2) = 2 in this illustration,
indicate that the two items are frequently ordered together.

Order No. Item 1 Item 2 |V1k – V2k |


k V1k V2k

1 1 1 0
2 0 1 1
3 1 0 1
4 0 0 0
2
Grouping Variables Clustering:
Occasionally, clustering methods are applied to group variables rather than
observations. One situation where such a grouping is desirable is the design of
questionnaires. The First draft of a questionnaire often contains more questions than is
prudent to ensure a good response rate. When the draft questionnaire is tested on a
small number of respondents it may be observed that the responses to certain groups of
questions are highly correlated. Clustering analysis may be applied to identify groups of
questions that are similar to one another, in the sense that the answers to these
questions are correlated. Then, in the final form of the questionnaire only one of the
questions in each cluster of similar questions may be used as representative of all the
questions in the cluster.
For example, consider the following responses to three questions by five respondents to
the first draft of a questionnaire:
Respondent Q1 Q2 Q3
a 10 5 3
b 30 7.5 3.1
c 20 6 2.9
d 40 8 2.95
Contd…
The correlation coefficient, r, of Q1 and Q2 can be shown to be 0.984,that of Q1 and Q3
0.076, and that of Q2 and Q3 0.230. A measure of the “distance" (dissimilarity) between
two questions is 1 - r, and the starting table of distances between all pairs of questions
is: Any clustering method can now be applied to this table in the usual manner.

Correlations
Variable Q1 Q2 Q3
Q1 1 0.984 0.076
Q2 1 0.230
Q3 1

Dissimilarity (1- r) Values

Variable Q1 Q2 Q3
Q1 0 0.016 0.924

Q2 0 0.770

Q3 0
To Sum Up……..

Cluster analysis embraces a variety of methods, the main objective of


which is to group observations or variables into homogeneous and distinct
clusters.
For groupings based on variables, frequently used measures of the similarity of
observations are the Euclidean, squared, or city block distance, applied to the original,
standardized, or weighted variables. For groupings based on attributes, a measure of
the similarity of two observations is the ratio of the number of matches (identical
categories) to the number of [Link] measures are possible.
The nearest neighbor (single linkage), furthest neighbor (complete linkage) and average
linkage methods are examples of hierarchical agglomerative clustering methods. These
methods begin with as many clusters as there are observations and end with a single
cluster containing all observations; all clusters formed by these methods are mergers of
previously formed clusters. Other types of clustering methods are the hierarchical
divisive (beginning with a single cluster and ending with as many clusters as there are
observations) and the non-hierarchical methods (a notable example of which is the
k-means method often employed for \quick clustering" by some statistical programs).
Clustering methods can also be employed to group variables rather than observations,
as in the case of questionnaire design. These groupings are frequently based on the
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Quality of clustering results can be evaluated in various ways

67

You might also like