0% found this document useful (0 votes)
10 views39 pages

K-means Clustering Explained

Clustering is a data mining technique that groups similar data objects into clusters based on their attributes, with applications in various fields such as economics, marketing, and medicine. The document discusses basic clustering methods, particularly focusing on k-means clustering, which involves partitioning data into k clusters based on proximity to centroids. Limitations of k-means include challenges with clusters of varying sizes, densities, shapes, and the presence of outliers.

Uploaded by

lammoor.26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

K-means Clustering Explained

Clustering is a data mining technique that groups similar data objects into clusters based on their attributes, with applications in various fields such as economics, marketing, and medicine. The document discusses basic clustering methods, particularly focusing on k-means clustering, which involves partitioning data into k clusters based on proximity to centroids. Limitations of k-means include challenges with clusters of varying sizes, densities, shapes, and the presence of outliers.

Uploaded by

lammoor.26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHAPTER 8

Clustering
Clustering
K-means
Clustering
3

Clustering
• Clustering is the process of grouping a set of data objects
into multiple groups or clusters so that objects within a cluster
have high similarity, but are very dissimilar to objects in other
clusters.

• Dissimilarities and similarities are assessed based on the


attribute values describing the objects and often involve
distance measures.

• Clustering as a data mining tool has its roots in many


application areas such as biology, security, business
intelligence, and Web search.
4

Clustering
In many fields there are obvious benefits to be had from
grouping together similar objects. For example
In an economics application, we might be interested in finding
countries whose economies are similar.
In a marketing application, we might wish to find clusters of
customers with similar buying behavior.
In a medical application, we might wish to find clusters of
patients with similar symptoms.
In a document retrieval application, we might wish to find
clusters of documents with related content or topics.
In business intelligence, we might wish to organize a large
number of customers into groups, where customers within a
group share strong similar characteristics.
5

What is Cluster Analysis?


• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster
distances are
distances are
maximized
minimized
6

Clustering
Object can be described by the values of just two attributes.
So, we can represent them as points in a two-dimensional
space (a plane).
7

Clustering
It is usually easy to visualize clusters in two
dimensions.
The points seem to fall naturally into four groups
as shown by the curves drawn surrounding sets of
points.
8

Clustering
• However, there is frequently more than one
possibility.
9

Clustering
• What is a natural grouping among these objects?
10

Clustering
11

Basic Clustering Methods


The major fundamental clustering methods can be
classified into the following categories:

• Hierarchical algorithms:
• A set of nested clusters organized as a hierarchical tree

• Partitional algorithms:
• A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
12

Basic Clustering Methods


13

Basic Clustering Methods


Method General Characteristics

• Clustering is a hierarchical decomposition (i.e., multiple

Hierarchical levels).
methods • Cannot correct erroneous merges or splits.
• Permit cluster to have sub clusters.
Partitioning • Find mutually exclusive clusters of round shape.
methods • Distance-based.
• May use mean or medoid (etc.) to represent cluster center.
• Effective for small- to medium-size data sets.
• Division of the set of data object into non-overlapping
subsets.
14

Basic Clustering Methods


15

Partitioning methods
2- Partitioning methods: w̅
toGroup
object

• Given a set of n objects, a partitioning method constructs


k partitions of the data, where each partition represents a
cluster and n ≥ k. That is, it divides the data into k groups
ÉF objects group
such that each group must contain at least one object.

• In other words, partitioning methods conduct one-level


partitioning on data sets.

• The basic partitioning methods typically adopt exclusive


cluster separation. That is, each object must belong to
exactly one group.
16

Partitioning methods Interclasturing May distance


is I intra in min distance
❑ Most partitioning methods are distance-based.

❑ Given k, the number of partitions to construct, a partitioning method

creates an initial partitioning.

❑ It then uses an iterative relocation technique that attempts to improve

the partitioning by moving objects from one group to another.

❑ The general criterion of a good partitioning is that objects in the same

cluster are “close” or related to each other, whereas objects in different


clusters are “far apart” or very different. There are various kinds of
other.
K-Means
18

K-means Clustering
There are many methods of clustering. The most commonly
used algorithms are: k-means clustering and hierarchical
clustering.

k-means clustering is an exclusive clustering algorithm. Each object


is assigned to precisely one of a set of clusters.

For this method of clustering, we start by deciding how many


clusters we would like to form from the data.
We call this value k.

The value of k is generally a small integer, such as 2, 3, 4 or 5, but


may be larger.
19

K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid I
• Number of clusters, K, must be specified
• The basic algorithm is very simple

Steps
Random
20

Algorithm k-means
The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where k
< n.
Steps
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise go to 3.
21

K-means Clustering – Details


g.s 1
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc. [Link]

• K-means will converge for common similarity measures


mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
• I = number of iterations, d = number of attributes
22

Algorithm k-means

Iii
23

Importance of Choosing Initial Centroids


Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

i
Iteration 4 Iteration 5 Iteration 6
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
24

Distance Measure between Objects


It is first necessary to decide on a way of measuring the
distance between two points.
As for nearest neighbour classification, a measure commonly
used when clustering is the Euclidean distance.
We will assume that all attribute values are continuous.

of
25

Centroid
We need to introduce the notion of the ‘centre’ of a
cluster, generally called its centroid.
Assuming that we are using Euclidean distance or
something similar as a measure, we define the
centroid of a cluster to be the point for which each
attribute value is the average of the values of the
corresponding attribute for all the points in the cluster.

Hobject

centroid mean
26

K-means Clustering Algorithm: Example

We will use the k-means


algorithm to cluster 16 objects.
I I
Each object with two attributes
x and y. 3vahe
ofdiston
2Ddimention
27

K-means Clustering Algorithm: Example


28

K-means Clustering Algorithm: Example


Step 1: Set initial k and initial centroids
We will assume that we have chosen k = 3 and that
three points have been selected to be the locations
of the initial three centroids.
This initial (arbitrary) centroids are shown below.
29

K-means Clustering Algorithm: Example


Step 2: Calculate the Euclidean distance
Calculate the Euclidean distance of each of the 16
points from the three centroids.
For example, the distance of the first point (6.8, 12.6)
from the first centroid (3.8, 9.9) is simply

Step 3: Assign objects to clusters


The column ‘cluster’ indicates the centroid closest to
each point and thus the cluster to which it should be
assigned.
30

K-means Clustering Algorithm: Example

Fitation

meEating
Average of the clustrius
9am
31

K-means Clustering Algorithm: Example


0.8 1.2 2.893.8 4.4 4.8 6.6 8.2 9
y
C [Link].916s 7ft astsa
32

K-means Clustering Algorithm: Example


Step 4: Recalculate the centroids of each cluster

• We next calculate the centroids of the three clusters using the x and
y values of the objects currently assigned to each one. For example:
6.0+6.2+7.6
• Centroid 3= 𝑥 = = 6.6
3
19.9+18.5+17.4
• Centroid 3= 𝑦 = = 18.5
3

the
Just
Same
33

K-means Clustering Algorithm: Example


• The three centroids have all been moved by
the assignment process, but the movement of
the third one is less than for the other two.
34

K-means Clustering Algorithm: Example


d1 d2 d3 Cluster
Repeat Step 2 & Step 3: Calculate
the Euclidean distance from new
centroids, then assign objects to
clusters
This gives the revised set of
clusters shown in Figure 14.11.
The centroids from now on the
centroids are ‘imaginary points’
corresponding to the ‘centre’ of
each cluster, not actual points
within the clusters.
In fact only one point has moved.
The object at (8.3, 6.9) has moved
from cluster 2 to cluster 1.
35

K-means Clustering Algorithm: Example


Repeat Step 4: recalculate the positions of the three
centroids.

Stop
6 dig
I
36

K-means Clustering Algorithm: Example


The first two centroids have moved a little, but the
third has not moved at all.
We assign the 16 objects to clusters once again, as
below.
37

K-means Clustering Algorithm: Example


These are the same clusters as before.
Their centroids will be the same as
those from which the clusters were
generated.
Hence the termination condition of the k-
means algorithm ‘repeat ... until the
centroids no longer move’ has been
met and these are the final clusters wie [Link] 11 I
produced by the algorithm.
38

Limitations of K-means
• K-means has problems when clusters are of
differing
supporise
• Sizes Classfication
Chestring unsuppernised
• Densities
• Non-round shapes

• K-means has problems when the data contains


outliers. I
1 Éo
datacleaningveryimportanttoavoidhis
problems
39

End of the Chapter

Thanks

You might also like