0% found this document useful (0 votes)
4 views63 pages

Odule Lustering Nalysis

Module 4 covers clustering analysis, including its definition, applications, and desired features. It discusses various clustering methods such as k-means, hierarchical clustering, and density-based clustering, along with distance metrics like Euclidean and Manhattan distances. The module also highlights the advantages and disadvantages of these methods, particularly focusing on the k-means algorithm and hierarchical clustering.

Uploaded by

veeba
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views63 pages

Odule Lustering Nalysis

Module 4 covers clustering analysis, including its definition, applications, and desired features. It discusses various clustering methods such as k-means, hierarchical clustering, and density-based clustering, along with distance metrics like Euclidean and Manhattan distances. The module also highlights the advantages and disadvantages of these methods, particularly focusing on the k-means algorithm and hierarchical clustering.

Uploaded by

veeba
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 4: CLUSTERING ANALYSIS

OUTLINE
 Introduction to clustering
 Applications of Cluster Analysis
 Desired Features of Clustering
 Distance Metrics
 Major Clustering Methods/Algorithms
 Partitioning Clustering: k-means clustering
 Hierarchical Clustering Algorithms (HCA):
 Agglomerative clustering
 Divisive clustering
 Density-based clustering
 DBSCAN algorithm.
 Implementation of Clustering methods
INTRODUCTION TO CLUSTER ANALYSIS
 Clustering is defined as grouping a set of similar objects into
classes or clusters.
 In other words, during cluster analysis, the data is grouped into
classes or clusters, so that records within a cluster (intra-cluster)
have high similarity with one another but have high
dissimilarities in comparison to objects in other clusters (inter-
cluster).
CLUSTER ANALYSIS
APPLICATIONS OF CLUSTER ANALYSIS
 Marketing: It helps companies group customers with similar
behavior to improve targeted marketing and promotions.
 Land use: is used for identifying areas of similar land (agricultural,
forest, industrial, residential etc.).
 Insurance: is helpful for Insurance companies group customers
based on claim patterns.
 City-planning: It helps to group houses based on type, location,
and value (Luxury villas, middle class apartments, low cost housing
etc.).
 Earthquake studies: helps to identify areas where earthquakes
happen frequently.
 Biology studies: helps to group plants, animals, or genes with
similar features.
 Web discovery: groups similar online documents or websites.
 Fraud detection: helps find unusual behavior (outliers).
DESIRED FEATURES OF CLUSTERING
 Scalability: should work well for both small and large amounts of
data.
 Ability to handle different types of attributes: Should work with
numerical, categories and binary data.
 Independent of data input order: should not dependent on the
ordering of input data.
 Identification of clusters with different shapes: should be capable of
identifying clusters of any shape.
 Ability to handle noisy data: should work properly even if the data
has errors, missing values, or unwanted information
 High performance: It should work fast and not use too much
memory or processing power.
 Interpretability: results should be easy to understand and useful for
decision-making.
 Ability to stop and resume: For very large data, the process should
allow pausing and continuing later.
 Minimal user guidance: should not require too many instructions or
expert supervision to work properly.
REAL TIME APPLICATIONS OF CLUSTER
ANALYSIS:
 It is widely used in image processing, data analysis and
pattern recognition.
 It helps marketers to find the distinct groups in their customer
base and they can characterize their customer groups by using
purchasing patterns.
 It can be used in the field of biology, by deriving animal and
plant taxonomies and identifying genes with the same
capabilities.
 It also helps in information discovery by classifying documents
on the web
DISTANCE METRICS
• A distance metric is a function d(x, y) that specifies the distance
between elements of a set as a non-negative real number.

• Two elements are equal under a particular metric if the distance


between them is zero.

• Distance functions present a method to measure the closeness of


two elements.

• Here, elements can be matrices, vectors or arbitrary objects and


do not necessarily need to be numbered.
Euclidean distance
 Euclidean distance is mainly used to calculate distances.
 The distance between two points in the plane with coordinates
(x, y) and (a, b) according to the Euclidean distance formula is
given by:

For example, the (Euclidean) distance between points


(-2, 2) and (2, -1) is calculated as
Euclidean dist((-2, 2), (2, - 1)) = √(-2 - (2))2 + (2 - (-1))2
= √(-4)2 + (3)2
= √16 + 9
= √25
=5
Distance Metrics

Data to calculate Euclidean distances among three persons

The calculation for the distance between person 1 and 2 is:???


The calculation for the distance between person 1 and 3is:???
The calculation for the distance between person 2 and 3 is:???
Distance Metrics
Manhattan distance
 Manhattan distance is also called L1-distance. It is defined as
the sum of the lengths of the projections of the line segment
between the two points on the coordinate axes.
 For example, the distance between two points in the plane with
coordinates (x, y) and (a, b) according to the Manhattan distance
formula, is given by:

Manhattan dist((x, y), (a, b)) = | x – a | + | y – b|


Distance Metrics
MAJOR CLUSTERING
METHODS/ALGORITHMS
 Clustering methods/algorithms can be categorized into five
categories which are given as follows:
• Partitioning method: It constructs random partitions and then
iteratively refines them by some criterion.
• Hierarchical method: It creates a hierarchical decomposition
of the set of data (or objects) using some criterion.
• Density-based method: It is based on connectivity and density
functions.
• Grid based method: It is based on a multiple-level granularity
structure.
• Model based method: A model is considered for each of the
clusters and the idea is to identify the best fit for that model.
Cluster
Analysis

Hierarchical Density based


Partitioning Grid based Model based
Methods Methods

Divisive
K-Means Agglomerative
1. PARTITIONING METHOD
 Clustering is the task of splitting a group of data or dataset into a
small number of clusters.

 For example, the items in a grocery store are grouped into


different categories (butter, milk, and cheese are clustered in dairy
products).

 This is a qualitative kind of partitioning.


 A quantitative approach on the other hand, measures certain
features of the products such as the percentage of milk and i.e.,
products having a high percentage of milk would be clustered
together.

 The k-means clustering is an important technique which falls


under partitioning clustering.
K-MEANS CLUSTERING ALGORITHM
 It is an unsupervised learning algorithm that is used to solve the
clustering problems in ML or data science.
 It groups the unlabeled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset belongs
only one group that has similar properties.
 It is a centroid-based algorithm, where each cluster is associated
with a centroid.
 The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding
clusters.
STEPS OF K-MEANS CLUSTERING
ALGORITHM

1. Select k points at random as centroids/cluster centers.


2. Assign data points to the closest cluster based on
Euclidean distance
3. Calculate centroid of all points within the cluster
4. Repeat iteratively till convergence. (Same points are
assigned to the clusters in consecutive iterations)
k-means clustering

Flow chart for K-means Clustering


k-means clustering

It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
k-means clustering
k-means clustering
k-means clustering
Elbow method to Choose the value of K
• The performance of the K-means clustering algorithm depends upon
highly efficient clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• The Elbow method is one of the most popular ways to find the
optimal number of clusters.
K-MEANS CLUSTERING
Step 1: take the mean value(random)
Step 2: Find nearest number of mean and place it in the
cluster
Step 3:Repeat Step 1 and 2 till you get same mean.

K={2,3,4,10,11,12,20,25,30}

K=2

Take 2 mean values(random)

m1=4 m2=12
K1={2,3,4} k2={10,11,12,20,25,30}
m1=3 m2=18
K1={2,3,4,10}k2={11,12,20,25,30}

m1=4.75 m2=19.6
K1={2,3,4,10,11,12} k2={20,25,30}

m1=7 m2=25
K1={2,3,4,10,11,12} k2={20,25,30}

m1=7 m2=25
Thus, we are getting same mean, stop…..
K means Clustering
Example-2
Step-1

Database after Initialization


K means Clustering Example-2

Step-1

Database after First Iteration


K means Clustering Example-2

Database after Second Iteration


K means Clustering Example-2

The clusters obtained are: {1,2} and {3,4,5,6,7}


At this iteration, there is no further change in
the clusters. Thus, the algorithm comes to a
halt
here and final result consists of 2 clusters, {1,2}
and {3,4,5,6,7} respectively.

Database after third Iteration


ADVANTAGES
 It is very easy to understand and implement.
 It works quickly, especially for large datasets.
 It performs very well when data is numeric.
 It can handle large amounts of data efficiently.
 It requires less memory and computational power compared to
many other clustering methods.
 Widely Used in Real time Applications (image segmentation,
market segmentation, and pattern recognition).
 Dense clusters are formed with K-means as compared to
Hierarchical clustering.
DISADVANTAGES
 It is difficult to predict the number of clusters (value of k).
 Output is strongly impacted by initial inputs like number of
clusters (value of k)
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by
means of normalization or standardization, then the output
will completely change.
 It is not good in doing clustering job if the clusters have a
complicated geometric shape.
 Works Only with Numerical Data (cannot handle categorical
data).
 Not Suitable for Non-Spherical Clusters
APPLICATIONS OF K-MEANS CLUSTERING
ALGORITHM
 Market segmentation: Grouping similar products based on
customer buying patterns.
 Document Clustering:Organizing similar documents or
articles into groups
 Customer segmentation: Grouping customers based on
purchasing behavior for targeted marketing
 Image segmentation: Dividing an image into different
regions based on color or intensity.
 Pattern Recognition: Used in speech recognition and
handwriting recognition systems.
 Data Compression: Reducing data size by grouping similar
data points together
ISSUES WITH K MEANS CLUSTERING
 The results of the k-means algorithm intensively depends on the initial
guesses of the seed records.
 It is sensitive to outliers. Thus, it may give poor results if an outlier is
selected as a starting seed record.
 It works on the basis of Euclidean distance and the mean(s) of the
attribute values of intra-cluster objects. Thus, it is only suitable for
continuous data.
 It does not take into account the size of the clusters.
 Clusters may be either large or small.
• It does not deal with overlapping clusters.
• It does not work well with clusters of different sizes and density.
HIERARCHICAL CLUSTERING
Hierarchical Clustering
 it is also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called
clusters.
 The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each
cluster are broadly similar to each other.
TYPES OF HIERARCHICAL CLUSTERING
1. Agglomerative clustering uses a bottom-up approach,
wherein each data point starts in its own cluster.
 These clusters are then joined greedily, by taking the two most
similar clusters together and merging them.
2. Divisive clustering uses a top-down approach, wherein all
data points start in the same cluster.
 For each cluster, you further divide it down to two clusters
until you hit the desired number of clusters.
HIERARCHICAL CLUSTERING
DENDROGRAM IN HIERARCHICAL
CLUSTERING

 The dendrogram is a tree-like structure that is mainly used


to store each step as a memory that the HC algorithm
performs.
 In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all
the data points of the given dataset.
HIERARCHICAL CLUSTERING

Role of linkage metrics


•It is required to compute the distance between clusters by using some
metrics before any clustering is performed. These metrics are called linkage
metrics.
•It will be beneficial to identify the proximity matrix consisting of the
distance between each point using a distance function.

Types of linkage metrics

1. Single linkage
In this , the shortest distance
between two points in each cluster is
considered as the distance between
two clusters.
HIERARCHICAL CLUSTERING
2. Complete linkage
In this, the longest distance
between two points in each cluster
is considered as the distance
between two clusters.

3. Average linkage
In this, the average distance
between each point in one
cluster to every point in the
other cluster is considered as
the distance between two
clusters.
 Let’s solve the problem by hand using both the types of
agglomerative hierarchical clustering :
 Single Linkage : In this, we merge in each step the two
clusters, whose two closest members have the smallest distance.

Using single linkage two clusters are formed :


Cluster 1 : (7,10)
Cluster 2 : (20,28,35)
2. Complete Linkage : In this, we merge in the members of the
clusters in each step, which provide the smallest maximum
pairwise distance.

Using complete linkage two clusters are formed :


Cluster 1 : (7,10,20)
Cluster 2 : (28,35)
ADVANTAGES OF HIERARCHICAL
CLUSTERING

 The ability to handle clusters of different sizes and densities.


 The ability to handle missing data and noisy data.
 No need to specify the number of clusters in advance
 It creates a tree-like structure called a Dendrogram, which
helps visualize relationships among data points.
 Easy to understand and interpret
 Works well for small datasets
 Different distance measures such as Euclidean Distance or
Manhattan Distance can be used.
 Once clusters are formed, the algorithm does not need
repeated adjustments like some partitioning algorithms.
DRAWBACKS OF HIERARCHICAL
CLUSTERING
 High computational cost: requires large computation time
and memory, so it is not suitable for very large datasets.
 Sensitive to noise and outliers
 Irreversible merging or splitting: Once clusters are merged
or split during the process, they cannot be changed later.
 Difficult to determine the optimal number of clusters
 Poor scalability: The algorithm becomes inefficient as the
dataset size increases compared to methods like K-Means
Clustering.
 Choice of distance measure affects results: Different
measures such as Euclidean Distance or Manhattan Distance
may produce different clustering outcomes.
 Storage complexity: It requires storing a distance matrix,
which consumes large memory for big datasets.
APPLICATIONS OF HIERARCHICAL
CLUSTERING
 Biology and Genetics: Used to classify plants, animals, and
genes based on similarities. It is widely applied in
Bioinformatics for gene expression analysis.
 Document Classification: Helps group similar documents or
web pages for better organization and information retrieval.
 Market Research and Customer Segmentation:
Businesses use it to group customers based on purchasing
behavior and preferences.
 Image Processing: Used to group similar pixels or image
features for object recognition and pattern detection.
 Social Network Analysis: Helps identify communities or
groups of people with similar interactions.
 Medical Research: Used to group patients with similar
symptoms or diseases for better diagnosis and treatment
analysis.
 Data Mining and Pattern Recognition Helps discover
hidden patterns and relationships in large datasets.
DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISE (DBSCAN)
 It is based on this intuitive notion of “clusters” and “noise”.
 The key idea is that for each point of a cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.

Why DBSCAN?
 Partitioning methods and hierarchical clustering work for
finding spherical-shaped clusters.
 Moreover, they are also severely affected by the presence of
noise and outliers in the data.
PARAMETERS REQUIRED FOR DBSCAN
ALGORITHM
 eps: It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to ‘eps’ then they are
considered neighbors.
 MinPts: Minimum number of neighbors (data points) within eps
radius.
 In this algorithm, we have 3 types of data points.
1. Core Point: A point is a core point if it has more than MinPts
points within eps.
2. Border Point: A point which has fewer than MinPts within eps
but it is in the neighborhood of a core point.
3. Noise or outlier: A point which is not a core point or border point
STEPS USED IN DBSCAN ALGORITHM
1. Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster,
create a new cluster.
3. Find recursively all its density-connected points and assign
them to the same cluster as the core point.

 Iterate through the remaining unvisited points in the dataset.


Those points that do not belong to any cluster are noise.
ADVANTAGES
 It can find clusters with any shape.
 No need to specify number of clusters: Unlike K-Means
Clustering, it automatically determines clusters based on
density.
 Detects noise and outliers: It can identify noisy data points
that do not belong to any cluster.
 Works well with spatial data: It is effective for datasets where
clusters are formed based on density distribution
 Less sensitive to initialization: The algorithm does not
depend heavily on initial starting points.
DISADVANTAGES
 Difficulty in choosing parameters: Selecting appropriate
values for Epsilon (ε) and MinPts can be challenging.
 Poor performance with varying densities: it struggles
when clusters have different densities.
 High computational cost for large datasets: It may become
slow for very large datasets.
 Sensitive to distance measurement: The clustering result
depends on the distance metric used.
APPLICATIONS
 Spatial data analysis: Used in geographic and
environmental studies to detect clusters of locations.
 Image processing: Helps identify objects and regions in
images.
 Anomaly detection: Used to detect unusual patterns or
outliers in datasets.
 Market analysis: Helps group customers with similar
purchasing patterns.
 Social network analysis: Used to detect communities or
groups within large networks.
 Medical data analysis: Helps cluster patients with similar
symptoms or disease patterns.
DBSCAN VS K- MEANS CLUSTERING
ALGORITHMS

DBSCAN K-Means

• We need not specify the number • It is very sensitive to number of


of clusters. clusters so it need to specified

• Clusters formed can be of any • Clusters formed are spherical


arbitrary shape. or convex in shape

• Can work well with datasets • Does not work well with outliers
having noise and outliers data.

• Two parameters are required for • only one parameter is required is


training the Model for training the model
Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7,
3), F(6, 2), G(7, 2) and H(8, 4), Find the core points and
outliers using DBSCAN. Take Eps = 2.5 and MinPts =
3
 Given, Epsilon(Eps) = 2.5
Minimum Points(MinPts) = 3
Let’s represent the given data points in tabular form:
Step 1: To find the core points, outliers and clusters by using DBSCAN we
need to first calculate the distance among all pairs of given data point. Let us
use Euclidean distance measure for distance calculation.
The final distance matrix becomes as shown below:

Step 2: Now, finding all the data points that lie in the Eps-neighborhood of
each data points. That is, put all the points in the neighborhood set of each
data point whose distance is <=2.5.

(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A


(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be
considered as core points. Since they belong to the neighborhood of other
data points, hence there exist no outliers in the given set of data points.

Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence
are the core data points.

Point Status
CORE NOISE BORDER

A √
B √
C √
D √
E √
F √
G √
H √
EXAMPLE 2
P1 0.00
Eps=1.9
P2 1.41 0.00
MinPts=4
P3 2.83 1.41 0.00

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P10 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0.00

P11 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0.00

P12 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11


Point Status

Noise Border
MinPts=4 P1
Core
P2
Noise Border
P3
Noise Border
P4

Core
P5
12
Noise Border
P6
3 Noise Border
P7
2 Noise Border
P8

1 Noise
4
P9
1 2
Noise Border
P10
Core
P11
Noise Border
P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11


Point X Y

3 7
P1
4 6
P2
5 5
P3 7

6 4
P4 6

7 3
P5 5
6 2
P6
7 2
P7 3

8 4
P8
2
3 3
P9
1
2 6
P10 1 2 4 5

3 5
P11
2 4
P12
Eps=1.9 MinPts=4
P1 0.00
Eps=1.9
P2 1.41 0.00
MinPts=4
P3 2.83 1.41 0.00

P4 4.24 2.83 1.41 0.00

P5 5.66 4.24 2.83 1.41 0.00

P6 5.83 4.47 3.16 2.00 1.41 0.00

P7 6.40 5.00 3.61 2.24 1.00 1.00 0.00

P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0.00

P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0.00

P10 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0.00

P11 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0.00

P12 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0.00

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4,P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:P5

P9:P12 P10:P1, P11 P11: P2, P10,P12 P12:P9, P11


Point Status

Noise Border
MinPts=4 P1
Core
P2
Noise Border
P3
Noise Border
P4

Core
P5
12
Noise Border
P6
3 Noise Border
P7
2 Noise Border
P8

1 Noise
4
P9
1 2
Noise Border
P10
Core
P11
Noise Border
P12

P1:P2,P10 P2:P1, P3, P11 P3:P2, P4 P4:P3, P5

PS:P4, P6,P7,P8 P6:P5, P7 P7:P5, P6 P8:PS

P9:P12 P10:P1, P11 P11: P2,P10, P12 P12:P9, P11

You might also like