CHAPTER 8
Clustering
Clustering
K-means
Clustering
3
Clustering
• Clustering is the process of grouping a set of data objects
into multiple groups or clusters so that objects within a cluster
have high similarity, but are very dissimilar to objects in other
clusters.
• Dissimilarities and similarities are assessed based on the
attribute values describing the objects and often involve
distance measures.
• Clustering as a data mining tool has its roots in many
application areas such as biology, security, business
intelligence, and Web search.
4
Clustering
In many fields there are obvious benefits to be had from
grouping together similar objects. For example
In an economics application, we might be interested in finding
countries whose economies are similar.
In a marketing application, we might wish to find clusters of
customers with similar buying behavior.
In a medical application, we might wish to find clusters of
patients with similar symptoms.
In a document retrieval application, we might wish to find
clusters of documents with related content or topics.
In business intelligence, we might wish to organize a large
number of customers into groups, where customers within a
group share strong similar characteristics.
5
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster
distances are
distances are
maximized
minimized
6
Clustering
Object can be described by the values of just two attributes.
So, we can represent them as points in a two-dimensional
space (a plane).
7
Clustering
It is usually easy to visualize clusters in two
dimensions.
The points seem to fall naturally into four groups
as shown by the curves drawn surrounding sets of
points.
8
Clustering
• However, there is frequently more than one
possibility.
9
Clustering
• What is a natural grouping among these objects?
10
Clustering
11
Basic Clustering Methods
The major fundamental clustering methods can be
classified into the following categories:
• Hierarchical algorithms:
• A set of nested clusters organized as a hierarchical tree
• Partitional algorithms:
• A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
12
Basic Clustering Methods
13
Basic Clustering Methods
Method General Characteristics
• Clustering is a hierarchical decomposition (i.e., multiple
Hierarchical levels).
methods • Cannot correct erroneous merges or splits.
• Permit cluster to have sub clusters.
Partitioning • Find mutually exclusive clusters of round shape.
methods • Distance-based.
• May use mean or medoid (etc.) to represent cluster center.
• Effective for small- to medium-size data sets.
• Division of the set of data object into non-overlapping
subsets.
14
Basic Clustering Methods
15
Partitioning methods
2- Partitioning methods: w̅
toGroup
object
• Given a set of n objects, a partitioning method constructs
k partitions of the data, where each partition represents a
cluster and n ≥ k. That is, it divides the data into k groups
ÉF objects group
such that each group must contain at least one object.
• In other words, partitioning methods conduct one-level
partitioning on data sets.
• The basic partitioning methods typically adopt exclusive
cluster separation. That is, each object must belong to
exactly one group.
16
Partitioning methods Interclasturing May distance
is I intra in min distance
❑ Most partitioning methods are distance-based.
❑ Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning.
❑ It then uses an iterative relocation technique that attempts to improve
the partitioning by moving objects from one group to another.
❑ The general criterion of a good partitioning is that objects in the same
cluster are “close” or related to each other, whereas objects in different
clusters are “far apart” or very different. There are various kinds of
other.
K-Means
18
K-means Clustering
There are many methods of clustering. The most commonly
used algorithms are: k-means clustering and hierarchical
clustering.
k-means clustering is an exclusive clustering algorithm. Each object
is assigned to precisely one of a set of clusters.
For this method of clustering, we start by deciding how many
clusters we would like to form from the data.
We call this value k.
The value of k is generally a small integer, such as 2, 3, 4 or 5, but
may be larger.
19
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid I
• Number of clusters, K, must be specified
• The basic algorithm is very simple
Steps
Random
20
Algorithm k-means
The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where k
< n.
Steps
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise go to 3.
21
K-means Clustering – Details
g.s 1
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc. [Link]
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
• I = number of iterations, d = number of attributes
22
Algorithm k-means
Iii
23
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
i
Iteration 4 Iteration 5 Iteration 6
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
24
Distance Measure between Objects
It is first necessary to decide on a way of measuring the
distance between two points.
As for nearest neighbour classification, a measure commonly
used when clustering is the Euclidean distance.
We will assume that all attribute values are continuous.
of
25
Centroid
We need to introduce the notion of the ‘centre’ of a
cluster, generally called its centroid.
Assuming that we are using Euclidean distance or
something similar as a measure, we define the
centroid of a cluster to be the point for which each
attribute value is the average of the values of the
corresponding attribute for all the points in the cluster.
Hobject
centroid mean
26
K-means Clustering Algorithm: Example
We will use the k-means
algorithm to cluster 16 objects.
I I
Each object with two attributes
x and y. 3vahe
ofdiston
2Ddimention
27
K-means Clustering Algorithm: Example
28
K-means Clustering Algorithm: Example
Step 1: Set initial k and initial centroids
We will assume that we have chosen k = 3 and that
three points have been selected to be the locations
of the initial three centroids.
This initial (arbitrary) centroids are shown below.
29
K-means Clustering Algorithm: Example
Step 2: Calculate the Euclidean distance
Calculate the Euclidean distance of each of the 16
points from the three centroids.
For example, the distance of the first point (6.8, 12.6)
from the first centroid (3.8, 9.9) is simply
Step 3: Assign objects to clusters
The column ‘cluster’ indicates the centroid closest to
each point and thus the cluster to which it should be
assigned.
30
K-means Clustering Algorithm: Example
Fitation
meEating
Average of the clustrius
9am
31
K-means Clustering Algorithm: Example
0.8 1.2 2.893.8 4.4 4.8 6.6 8.2 9
y
C [Link].916s 7ft astsa
32
K-means Clustering Algorithm: Example
Step 4: Recalculate the centroids of each cluster
• We next calculate the centroids of the three clusters using the x and
y values of the objects currently assigned to each one. For example:
6.0+6.2+7.6
• Centroid 3= 𝑥 = = 6.6
3
19.9+18.5+17.4
• Centroid 3= 𝑦 = = 18.5
3
the
Just
Same
33
K-means Clustering Algorithm: Example
• The three centroids have all been moved by
the assignment process, but the movement of
the third one is less than for the other two.
34
K-means Clustering Algorithm: Example
d1 d2 d3 Cluster
Repeat Step 2 & Step 3: Calculate
the Euclidean distance from new
centroids, then assign objects to
clusters
This gives the revised set of
clusters shown in Figure 14.11.
The centroids from now on the
centroids are ‘imaginary points’
corresponding to the ‘centre’ of
each cluster, not actual points
within the clusters.
In fact only one point has moved.
The object at (8.3, 6.9) has moved
from cluster 2 to cluster 1.
35
K-means Clustering Algorithm: Example
Repeat Step 4: recalculate the positions of the three
centroids.
Stop
6 dig
I
36
K-means Clustering Algorithm: Example
The first two centroids have moved a little, but the
third has not moved at all.
We assign the 16 objects to clusters once again, as
below.
37
K-means Clustering Algorithm: Example
These are the same clusters as before.
Their centroids will be the same as
those from which the clusters were
generated.
Hence the termination condition of the k-
means algorithm ‘repeat ... until the
centroids no longer move’ has been
met and these are the final clusters wie [Link] 11 I
produced by the algorithm.
38
Limitations of K-means
• K-means has problems when clusters are of
differing
supporise
• Sizes Classfication
Chestring unsuppernised
• Densities
• Non-round shapes
• K-means has problems when the data contains
outliers. I
1 Éo
datacleaningveryimportanttoavoidhis
problems
39
End of the Chapter
Thanks