Cluster Analysis
Cluster analysis is also known as clustering, which groups similar data points forming
clusters. The goal is to ensure that data points within a cluster are more similar to each other
than to those in other clusters. For example, in e-commerce retailers use clustering to group
customers based on their purchasing habits. If one group frequently buys fitness gear while
another prefers electronics. This helps companies to give personalized recommendations
and improve customer experience.
Distance Metrics
Distance metrics are simple mathematical formulas to figure out how similar or different two
data points are. Type of distance metrics we choose plays a big role in deciding clustering
results. Some of the common metrics are:
Euclidean Distance: It is the most widely used distance metric and finds the straight-
line distance between two points.
Manhattan Distance: It measures the distance between two points based on grid-like
path. It adds the absolute differences between the values.
Cosine Similarity: This method checks the angle between two points instead of
looking at the distance. It’s used in text data to see how similar two documents are.
Jaccard Index: A statistical tool used for comparing the similarity of sample sets. It’s
mostly used for yes/no type data or categories.
Types of Clustering Techniques
1. Partitioning Methods
Partitioning Methods divide the data into k groups (clusters) where each data point
belongs to only one group. These methods are used when you already know how
many clusters you want to create. A common example is K-means clustering.
In K-means the algorithm assigns each data point to the nearest center and then
updates the center based on the average of all points in that group. This process
repeats until the centres stop changing. It is used in real-life applications like
streaming platforms like Spotify to group users based on their listening habits.
2. Hierarchical Methods
Hierarchical clustering builds a tree-like structure of clusters known as a dendrogram that
represents the merging or splitting of clusters. It can be divided into:
Agglomerative Approach (Bottom-up): Agglomerative Approach starts with
individual points and merges similar ones. Like a family tree where relatives are
grouped step by step.
Divisive Approach (Top-down): It starts with one big cluster and splits it repeatedly
into smaller clusters. For example, classifying animals into broad categories like
mammals, reptiles, etc and further refining them.
3. Density-Based Methods
Density-based clustering group data points that are densely packed together and
treat regions with fewer data points as noise or outliers. This method is particularly
useful when clusters are irregular in shape.
For example, it can be used in fraud detection as it identifies unusual patterns of
activity by grouping similar behaviors together.
4. Grid-Based Methods
Grid-Based Methods divide data space into grids making clustering efficient. This
makes the clustering process faster because it reduces the complexity by limiting the
number of calculations needed and is useful for large datasets.
Climate researchers often use grid-based methods to analyze temperature variations
across different geographical regions. By dividing the area into grids they can more
easily identify temperature patterns and trends.
5. Model-Based Methods
Model-based clustering groups data by assuming it comes from a mix of
distributions. Gaussian Mixture Models (GMM) are commonly used and assume the
data is formed by several overlapping normal distributions.
GMM is commonly used in voice recognition systems as it helps to distinguish
different speakers by modeling each speaker’s voice as a Gaussian distribution.
Applications of Cluster Analysis
Market Segmentation: This is used to segment customers based on purchasing
behavior and allow businesses send the right offers to the right people.
Image Segmentation: In computer vision it can be used to group pixels in an image
to detect objects like faces, cars or animals.
Biological Classification: Scientists use clustering to group genes with similar
behaviors to understand diseases and treatments.
Document Classification: It is used by search engines to categorize web pages for
better search results.
Anomaly Detection: Cluster Analysis is used for outlier detection to identify rare data
points that do not belong to any cluster.
Challenges in Cluster Analysis
While clustering is very useful for analysis it faces several challenges:
Choosing the Number of Clusters: Methods like K-means requires user to specify the
number of clusters before starting which can be difficult to guess correctly.
Scalability: Some algorithms like hierarchical clustering does not scale well with large
datasets.
Cluster Shape: Many algorithms assume clusters are round or evenly shaped which
doesn’t always match real-world data.
Handling Noise and Outliers: They are sensitive to noise and outliers which can
affect the results.
Partitioning Method (K-Mean) in Data Mining
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning method constructs
user-specified(K) partitions of the data in which each partition represents a cluster and a
particular region. There are many algorithms that come under partitioning method some of
the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc. In this article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intracluster) is high but the similarity of
data objects with the data objects from outside the cluster is low (intercluster). The
similarity of the cluster is determined with respect to the mean value of the cluster. It is a
type of square error algorithm. At the start randomly k objects from the dataset are chosen
in which each of the objects represents a cluster mean(centre). For the rest of the data
objects, they are assigned to the nearest cluster based on their distance from the cluster
mean. The new mean of each of the cluster is then calculated with the added data objects.
Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-
29) and (36-66) as 2 clusters we get using K Mean Algorithm.
Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis in data mining that creates a
hierarchical representation of the clusters in a dataset. The method starts by treating each
data point as a separate cluster and then iteratively combines the closest clusters until a
stopping criterion is reached. The result of hierarchical clustering is a tree-like structure,
called a dendrogram, which illustrates the hierarchical relationships among the clusters.
Types of Hierarchical Clustering
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the
clusters which aren't comparable. In the end, we are left with N clusters.
Advantages
Handle non-convex clusters and clusters of different sizes and densities.
Handle missing data and noisy data.
Drawbacks
The need for a criterion to stop the clustering process and determine the final
number of clusters.
The computational cost and memory requirements of the method can be high,
especially for large datasets.
The results can be sensitive to the initial conditions, linkage criterion, and distance
metric used.
Write example for each method (refer material)
Density based clustering
Density-based clustering is a clustering technique that groups data points based on the
density of data in a region. It is particularly useful for discovering clusters of arbitrary shapes
and for handling noise (outliers) effectively.
Popular Algorithms
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Most commonly used density-based algorithm. Works with two parameters:
1. eps: This defines the radius of the neighborhood around a data point. If the distance
between two points is less than or equal to eps they are considered neighbors. A common
method to determine eps is by analyzing the k-distance graph. Choosing the right eps is
important:
If eps is too small most points will be classified as noise.
If eps is too large clusters may merge and the algorithm may fail to distinguish
between them.
2. MinPts: This is the minimum number of points required within the eps radius to form a
dense region. A general rule of thumb is to set MinPts >= D+1 where D is the number of
dimensions in the dataset. For most cases a minimum value of MinPts = 3 is
recommended.
OPTICS (Ordering Points To Identify Clustering Structure)
An extension of DBSCAN.
Handles clusters of varying densities more effectively.
Advantages
Can identify clusters of any shape.
Automatically detects noise and outliers.
Does not require the number of clusters in advance.
Disadvantages
Choosing suitable parameters (ε and MinPts) is sometimes difficult.
Performance decreases with high-dimensional data.
Applications
Geographical data analysis
Image segmentation
Anomaly detection
Market segmentation
Sensor data analysis