1.
Introduction to Clustering
Clustering is a technique used in data analysis and machine learning to group similar objects
together.
Objects in the same cluster are similar, and objects in different clusters are different.
Simple Real-Life Example
Suppose a teacher wants to group students based on marks.
Student Marks
A 90
B 88
C 85
D 40
E 45
F 42
Possible clusters:
Cluster 1 → High marks students
Cluster 2 → Low marks students
This grouping of similar students is called clustering.
2. What is Hierarchical Clustering?
Hierarchical clustering is a method of clustering where clusters are formed step-by-step in a
hierarchical structure (like a tree).
Instead of directly creating clusters, the algorithm builds a tree of clusters.
This tree structure is called a Dendrogram.
Hierarchy means levels.
Example hierarchy in real life:
Living things classification:
Animal
→ Mammal
→ Dog
→ Labrador
This is hierarchical grouping.
3. Types of Hierarchical Clustering
There are two main types of hierarchical clustering:
1. Agglomerative Hierarchical Clustering
• Start with each data point as its own cluster
• Gradually merge clusters together
This is also called Bottom-Up approach.
2. Divisive Hierarchical Clustering
• Start with all data points in one cluster
• Gradually split clusters
This is called Top-Down approach.
In this topic we focus on:
Agglomerative Hierarchical Clustering
4. What is Agglomerative Hierarchical Clustering?
Agglomerative Hierarchical Clustering is a clustering technique where:
• Each data point starts as a separate cluster
• The closest clusters are merged step by step
• This continues until all data points form one cluster
The word Agglomerative means:
“Gradually joining or merging things together.”
5. Real-Life Example
Imagine 6 students standing separately in a playground.
Step 1
Each student stands alone.
Step 2
Students with similar marks stand together.
Step 3
Small groups join to form bigger groups.
Step 4
Eventually all students form one large group.
This process is similar to agglomerative clustering.
6. Steps of Agglomerative Hierarchical Clustering
The algorithm works in the following steps.
Step 1: Start with Individual Clusters
Each data point forms its own cluster.
Example data:
Student Marks
A 10
B 12
C 25
D 27
Initial clusters:
Cluster1 → A
Cluster2 → B
Cluster3 → C
Cluster4 → D
Step 2: Calculate Distance Between Clusters
The algorithm calculates distance between all clusters.
Distance tells how similar or different clusters are.
Common distance measure:
Euclidean Distance
Example:
Distance between A(10) and B(12) = 2
Step 3: Merge the Closest Clusters
The two clusters with smallest distance are merged.
Example:
Distance between A and B = 2
Distance between C and D = 2
Smallest clusters merge.
New clusters:
Cluster1 → A, B
Cluster2 → C
Cluster3 → D
Step 4: Recalculate Distance
Now distances are calculated between new clusters.
Example:
Cluster1 (A,B)
Cluster2 (C)
Cluster3 (D)
Step 5: Repeat the Process
Again merge the closest clusters.
Eventually we get:
Cluster1 → A, B
Cluster2 → C, D
Then finally:
Cluster → A, B, C, D
7. Dendrogram (Tree Diagram)
The clustering process is often shown using a dendrogram.
A dendrogram is a tree-like diagram that shows:
• How clusters merge
• At what distance clusters combine
Example structure:
______ A
|______ B
_____|
| |______ C
|______ D
The height of the branches shows the distance between clusters.
By cutting the dendrogram at a certain level, we can decide number of clusters.
8. Linkage Methods in Agglomerative Clustering
When merging clusters, we must decide how to measure distance between clusters.
Different methods are called linkage methods.
1. Single Linkage
Distance between closest points of two clusters.
Example:
Cluster1 → {A,B}
Cluster2 → {C,D}
Distance = minimum distance between any two points.
2. Complete Linkage
Distance between farthest points of two clusters.
Example:
Distance = maximum distance between points.
3. Average Linkage
Distance = average of all pairwise distances between clusters.
4. Centroid Linkage
Distance between cluster centers (means).
9. Simple Real World Applications
1. Biology
Scientists group animals or plants based on genetic similarity.
Example hierarchy:
Animal
→ Mammal
→ Dog
→ Labrador
2. Document Organization
Search engines group documents:
Cluster 1 → Sports articles
Cluster 2 → Political articles
Cluster 3 → Technology articles
3. Customer Segmentation
Companies group customers based on:
• Age
• Spending behavior
• Interests
4. Image Processing
Pixels with similar colors are grouped together.
Example:
Cluster 1 → Sky pixels
Cluster 2 → Tree pixels
Cluster 3 → Building pixels
10. Advantages of Agglomerative Clustering
✔ Easy to understand
✔ Does not require number of clusters in advance
✔ Produces hierarchical structure
✔ Useful for small datasets
11. Limitations
✖ Computationally expensive for large datasets
✖ Once clusters merge, they cannot be split again
✖ Sensitive to noise and outliers
12. Summary
Concept Meaning
Hierarchical Clustering Clustering with a tree structure
Agglomerative Bottom-up clustering method
Start Each point is its own cluster
Process Merge closest clusters step by step
End All points form one cluster
Visualization Dendrogram
Final Simple Definition
Agglomerative Hierarchical Clustering is a bottom-up clustering method in which each data point
starts as an individual cluster and the closest clusters are repeatedly merged until all points belong
to one cluster.
Cluster Validation
1. First Understand: What is Clustering?
Clustering is a technique in data analysis where we group similar data together.
• Objects inside the same cluster are similar.
• Objects in different clusters are different.
Real-World Example
Imagine a school where students are grouped based on marks.
Student Marks
A 85
B 82
C 30
D 28
Clusters may be:
• Cluster 1: A, B → High marks
• Cluster 2: C, D → Low marks
Here students with similar marks are grouped together.
2. What is Cluster Validation?
After creating clusters, we must check whether the clusters are correct and meaningful.
This checking process is called Cluster Validation.
Simple Definition
Cluster Validation is the process of evaluating how good the clusters are.
It answers questions like:
• Are objects inside clusters really similar?
• Are clusters clearly different from each other?
• Is the clustering meaningful or random?
3. Why Cluster Validation is Important
Sometimes clustering algorithms create wrong or meaningless groups.
Example:
Cluster 1 → A, C
Cluster 2 → B, D
But marks are:
• A = 85
• B = 82
• C = 30
• D = 28
Clearly this grouping is not logical.
Cluster validation helps detect such bad clustering results.
4. Goal of Cluster Validation
A good clustering should have:
1. High similarity within clusters
Objects inside a cluster should be very similar.
Example
Students with similar marks in one group.
2. Low similarity between clusters
Different clusters should be very different.
Example
Top performers and weak students should not be mixed.
5. Types of Cluster Validation
There are three main types:
1. Internal Validation
2. External Validation
3. Relative Validation
6. Internal Validation
Meaning
Internal validation checks cluster quality using only the dataset itself.
It does not require external information.
It measures:
• Distance between points in the same cluster
• Distance between different clusters
Real-Life Example
Suppose a supermarket groups customers based on spending.
Customer Spending
A ₹500
B ₹550
C ₹5000
D ₹5200
Clusters formed:
Cluster 1 → A, B
Cluster 2 → C, D
Customers in each cluster have similar spending, so clustering is good.
Internal validation confirms this.
7. External Validation
Meaning
External validation compares clustering results with known correct labels.
It checks whether clusters match real categories.
Real-Life Example
Suppose emails are categorized into:
• Spam
• Not Spam
If clustering groups emails exactly like these labels, the clustering is accurate.
External validation compares the clusters with true labels.
8. Relative Validation
Meaning
Relative validation compares different clustering results to find the best one.
Example:
We try:
• 2 clusters
• 3 clusters
• 4 clusters
Then validation checks which option gives the best grouping.
Real-Life Example
A company groups customers.
Option 1
• Low spenders
• High spenders
Option 2
• Low spenders
• Medium spenders
• High spenders
Relative validation helps decide which grouping is better.
9. Common Measures Used in Cluster Validation
Some mathematical measures are used to evaluate clusters.
1. Cohesion
Cohesion measures how close data points are inside a cluster.
Lower distance = better cluster.
Example
Students in a cluster have marks:
80, 81, 82
These values are very close → good cohesion.
2. Separation
Separation measures how far clusters are from each other.
Higher distance = better clusters.
Example
Cluster 1 → marks around 80
Cluster 2 → marks around 30
Clusters are clearly different → good separation.
10. Silhouette Score (Simple Idea)
Silhouette score is a popular method for cluster validation.
It measures:
• similarity within cluster
• difference between clusters
Score range:
Value Meaning
Close to 1 Very good clustering
Around 0 Overlapping clusters
Negative Wrong clustering
11. Practical Example (College Admission)
Suppose a college groups students based on entrance scores.
Student Score
A 92
B 90
C 45
D 40
Clusters:
Cluster 1 → A, B (top students)
Cluster 2 → C, D (low scores)
Validation checks:
• Students in cluster are similar
• Clusters are clearly different
So clustering is valid.
12. Advantages of Cluster Validation
1. Helps check cluster quality
2. Detects incorrect clustering
3. Helps choose best clustering method
4. Improves accuracy of data analysis
13. Summary
Concept Meaning
Cluster Validation Checking the quality of clusters
Internal Validation Uses only dataset
External Validation Uses true labels
Relative Validation Compares different clustering results
Cohesion Similarity inside cluster
Separation Difference between clusters
✔ In simple words:
Cluster validation helps us check whether the clusters created by a clustering algorithm are correct,
meaningful, and useful.
Silhouette Score – How to Calculate (Step-by-Step with Simple Example)
Silhouette Score measures how well a data point fits in its cluster.
It checks two things:
1. How close the point is to its own cluster
2. How far the point is from other clusters
The formula for silhouette score of a point is:
S = {b - a}/{ max(a,b)}
Where:
• a = average distance from the point to all other points in the same cluster
• b = average distance from the point to points in the nearest other cluster
Range of silhouette score:
Value Meaning
Close to 1 Very good clustering
Around 0 Clusters overlap
Negative Wrong clustering
Step-by-Step Example
Suppose we have 2 clusters of students based on marks.
Cluster 1
A = 80
B = 82
Cluster 2
C = 30
D = 32
We will calculate the silhouette score for student A.
Step 1: Calculate a (distance inside the cluster)
Distance between A and B
|80 - 82| = 2
Since B is the only other point in cluster 1:
a=2
Step 2: Calculate b (distance to other cluster)
Distances from A to points in Cluster 2
Distance A → C
|80 - 30| = 50
Distance A → D
|80 - 32| = 48
Average distance:
b = {50 + 48}/{2} = 49
Step 3: Apply Silhouette Formula
S = (b – a)/max(a,b)
Substitute values:
S = {49 - 2}/{49}
S = {47}/{49}
S =0.96
Step 4: Interpret the Result
Silhouette score ≈ 0.96
This means:
• A is very close to its own cluster
• A is very far from the other cluster
So the clustering is very good.
Visual Intuition
Cluster 1
80, 82
Cluster 2
30, 32
Points inside cluster → very close
Clusters themselves → very far apart
So silhouette score becomes close to 1.
Final Summary (Exam Friendly)
Steps to calculate silhouette score:
1. Select a data point.
2. Compute a = average distance to points in the same cluster.
3. Compute b = average distance to nearest other cluster.
4. Apply formula:
S = (b – a)/max(a,b)
5. Interpret the score.
✔ Short definition for exams:
Silhouette score measures how similar a data point is to its own cluster compared to other
clusters.