0% found this document useful (0 votes)
4 views16 pages

Clustering

The document provides an overview of clustering, specifically focusing on hierarchical and agglomerative clustering methods. It explains the steps involved in agglomerative clustering, the importance of cluster validation, and various validation techniques. Additionally, it discusses the silhouette score as a measure of clustering quality.

Uploaded by

Mohd shami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

Clustering

The document provides an overview of clustering, specifically focusing on hierarchical and agglomerative clustering methods. It explains the steps involved in agglomerative clustering, the importance of cluster validation, and various validation techniques. Additionally, it discusses the silhouette score as a measure of clustering quality.

Uploaded by

Mohd shami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Introduction to Clustering

Clustering is a technique used in data analysis and machine learning to group similar objects
together.

Objects in the same cluster are similar, and objects in different clusters are different.

Simple Real-Life Example

Suppose a teacher wants to group students based on marks.

Student Marks

A 90

B 88

C 85

D 40

E 45

F 42

Possible clusters:

Cluster 1 → High marks students


Cluster 2 → Low marks students

This grouping of similar students is called clustering.

2. What is Hierarchical Clustering?

Hierarchical clustering is a method of clustering where clusters are formed step-by-step in a


hierarchical structure (like a tree).

Instead of directly creating clusters, the algorithm builds a tree of clusters.

This tree structure is called a Dendrogram.

Hierarchy means levels.

Example hierarchy in real life:

Living things classification:

Animal
→ Mammal
→ Dog
→ Labrador
This is hierarchical grouping.

3. Types of Hierarchical Clustering

There are two main types of hierarchical clustering:

1. Agglomerative Hierarchical Clustering

• Start with each data point as its own cluster


• Gradually merge clusters together

This is also called Bottom-Up approach.

2. Divisive Hierarchical Clustering

• Start with all data points in one cluster


• Gradually split clusters

This is called Top-Down approach.

In this topic we focus on:

Agglomerative Hierarchical Clustering

4. What is Agglomerative Hierarchical Clustering?

Agglomerative Hierarchical Clustering is a clustering technique where:

• Each data point starts as a separate cluster


• The closest clusters are merged step by step
• This continues until all data points form one cluster

The word Agglomerative means:

“Gradually joining or merging things together.”

5. Real-Life Example

Imagine 6 students standing separately in a playground.

Step 1
Each student stands alone.

Step 2
Students with similar marks stand together.
Step 3
Small groups join to form bigger groups.

Step 4
Eventually all students form one large group.

This process is similar to agglomerative clustering.

6. Steps of Agglomerative Hierarchical Clustering

The algorithm works in the following steps.

Step 1: Start with Individual Clusters

Each data point forms its own cluster.

Example data:

Student Marks

A 10

B 12

C 25

D 27

Initial clusters:

Cluster1 → A
Cluster2 → B
Cluster3 → C
Cluster4 → D

Step 2: Calculate Distance Between Clusters

The algorithm calculates distance between all clusters.

Distance tells how similar or different clusters are.

Common distance measure:

Euclidean Distance

Example:

Distance between A(10) and B(12) = 2


Step 3: Merge the Closest Clusters

The two clusters with smallest distance are merged.

Example:

Distance between A and B = 2


Distance between C and D = 2

Smallest clusters merge.

New clusters:

Cluster1 → A, B
Cluster2 → C
Cluster3 → D

Step 4: Recalculate Distance

Now distances are calculated between new clusters.

Example:

Cluster1 (A,B)
Cluster2 (C)
Cluster3 (D)

Step 5: Repeat the Process

Again merge the closest clusters.

Eventually we get:

Cluster1 → A, B
Cluster2 → C, D

Then finally:

Cluster → A, B, C, D

7. Dendrogram (Tree Diagram)

The clustering process is often shown using a dendrogram.

A dendrogram is a tree-like diagram that shows:

• How clusters merge


• At what distance clusters combine

Example structure:
______ A

|______ B

_____|

| |______ C

|______ D

The height of the branches shows the distance between clusters.

By cutting the dendrogram at a certain level, we can decide number of clusters.

8. Linkage Methods in Agglomerative Clustering

When merging clusters, we must decide how to measure distance between clusters.

Different methods are called linkage methods.

1. Single Linkage

Distance between closest points of two clusters.

Example:

Cluster1 → {A,B}
Cluster2 → {C,D}

Distance = minimum distance between any two points.

2. Complete Linkage

Distance between farthest points of two clusters.

Example:

Distance = maximum distance between points.

3. Average Linkage

Distance = average of all pairwise distances between clusters.

4. Centroid Linkage
Distance between cluster centers (means).

9. Simple Real World Applications

1. Biology

Scientists group animals or plants based on genetic similarity.

Example hierarchy:

Animal
→ Mammal
→ Dog
→ Labrador

2. Document Organization

Search engines group documents:

Cluster 1 → Sports articles


Cluster 2 → Political articles
Cluster 3 → Technology articles

3. Customer Segmentation

Companies group customers based on:

• Age
• Spending behavior
• Interests

4. Image Processing

Pixels with similar colors are grouped together.

Example:

Cluster 1 → Sky pixels


Cluster 2 → Tree pixels
Cluster 3 → Building pixels

10. Advantages of Agglomerative Clustering

✔ Easy to understand
✔ Does not require number of clusters in advance
✔ Produces hierarchical structure
✔ Useful for small datasets

11. Limitations

✖ Computationally expensive for large datasets


✖ Once clusters merge, they cannot be split again
✖ Sensitive to noise and outliers

12. Summary

Concept Meaning

Hierarchical Clustering Clustering with a tree structure

Agglomerative Bottom-up clustering method

Start Each point is its own cluster

Process Merge closest clusters step by step

End All points form one cluster

Visualization Dendrogram

Final Simple Definition

Agglomerative Hierarchical Clustering is a bottom-up clustering method in which each data point
starts as an individual cluster and the closest clusters are repeatedly merged until all points belong
to one cluster.

Cluster Validation

1. First Understand: What is Clustering?

Clustering is a technique in data analysis where we group similar data together.

• Objects inside the same cluster are similar.

• Objects in different clusters are different.

Real-World Example

Imagine a school where students are grouped based on marks.


Student Marks

A 85

B 82

C 30

D 28

Clusters may be:

• Cluster 1: A, B → High marks

• Cluster 2: C, D → Low marks

Here students with similar marks are grouped together.

2. What is Cluster Validation?

After creating clusters, we must check whether the clusters are correct and meaningful.

This checking process is called Cluster Validation.

Simple Definition

Cluster Validation is the process of evaluating how good the clusters are.

It answers questions like:

• Are objects inside clusters really similar?

• Are clusters clearly different from each other?

• Is the clustering meaningful or random?

3. Why Cluster Validation is Important

Sometimes clustering algorithms create wrong or meaningless groups.

Example:

Cluster 1 → A, C
Cluster 2 → B, D

But marks are:

• A = 85

• B = 82

• C = 30
• D = 28

Clearly this grouping is not logical.

Cluster validation helps detect such bad clustering results.

4. Goal of Cluster Validation

A good clustering should have:

1. High similarity within clusters

Objects inside a cluster should be very similar.

Example
Students with similar marks in one group.

2. Low similarity between clusters

Different clusters should be very different.

Example
Top performers and weak students should not be mixed.

5. Types of Cluster Validation

There are three main types:

1. Internal Validation

2. External Validation

3. Relative Validation

6. Internal Validation

Meaning

Internal validation checks cluster quality using only the dataset itself.

It does not require external information.

It measures:

• Distance between points in the same cluster

• Distance between different clusters

Real-Life Example
Suppose a supermarket groups customers based on spending.

Customer Spending

A ₹500

B ₹550

C ₹5000

D ₹5200

Clusters formed:

Cluster 1 → A, B
Cluster 2 → C, D

Customers in each cluster have similar spending, so clustering is good.

Internal validation confirms this.

7. External Validation

Meaning

External validation compares clustering results with known correct labels.

It checks whether clusters match real categories.

Real-Life Example

Suppose emails are categorized into:

• Spam

• Not Spam

If clustering groups emails exactly like these labels, the clustering is accurate.

External validation compares the clusters with true labels.

8. Relative Validation

Meaning

Relative validation compares different clustering results to find the best one.

Example:

We try:
• 2 clusters

• 3 clusters

• 4 clusters

Then validation checks which option gives the best grouping.

Real-Life Example

A company groups customers.

Option 1

• Low spenders

• High spenders

Option 2

• Low spenders

• Medium spenders

• High spenders

Relative validation helps decide which grouping is better.

9. Common Measures Used in Cluster Validation

Some mathematical measures are used to evaluate clusters.

1. Cohesion

Cohesion measures how close data points are inside a cluster.

Lower distance = better cluster.

Example
Students in a cluster have marks:

80, 81, 82

These values are very close → good cohesion.

2. Separation

Separation measures how far clusters are from each other.

Higher distance = better clusters.

Example
Cluster 1 → marks around 80
Cluster 2 → marks around 30

Clusters are clearly different → good separation.

10. Silhouette Score (Simple Idea)

Silhouette score is a popular method for cluster validation.

It measures:

• similarity within cluster

• difference between clusters

Score range:

Value Meaning

Close to 1 Very good clustering

Around 0 Overlapping clusters

Negative Wrong clustering

11. Practical Example (College Admission)

Suppose a college groups students based on entrance scores.

Student Score

A 92

B 90

C 45

D 40

Clusters:

Cluster 1 → A, B (top students)


Cluster 2 → C, D (low scores)

Validation checks:

• Students in cluster are similar

• Clusters are clearly different

So clustering is valid.
12. Advantages of Cluster Validation

1. Helps check cluster quality

2. Detects incorrect clustering

3. Helps choose best clustering method

4. Improves accuracy of data analysis

13. Summary

Concept Meaning

Cluster Validation Checking the quality of clusters

Internal Validation Uses only dataset

External Validation Uses true labels

Relative Validation Compares different clustering results

Cohesion Similarity inside cluster

Separation Difference between clusters

✔ In simple words:
Cluster validation helps us check whether the clusters created by a clustering algorithm are correct,
meaningful, and useful.

Silhouette Score – How to Calculate (Step-by-Step with Simple Example)

Silhouette Score measures how well a data point fits in its cluster.

It checks two things:

1. How close the point is to its own cluster

2. How far the point is from other clusters

The formula for silhouette score of a point is:

S = {b - a}/{ max(a,b)}

Where:

• a = average distance from the point to all other points in the same cluster
• b = average distance from the point to points in the nearest other cluster

Range of silhouette score:

Value Meaning

Close to 1 Very good clustering

Around 0 Clusters overlap

Negative Wrong clustering

Step-by-Step Example

Suppose we have 2 clusters of students based on marks.

Cluster 1

A = 80
B = 82

Cluster 2

C = 30
D = 32

We will calculate the silhouette score for student A.

Step 1: Calculate a (distance inside the cluster)

Distance between A and B

|80 - 82| = 2

Since B is the only other point in cluster 1:

a=2

Step 2: Calculate b (distance to other cluster)

Distances from A to points in Cluster 2

Distance A → C
|80 - 30| = 50

Distance A → D

|80 - 32| = 48

Average distance:

b = {50 + 48}/{2} = 49

Step 3: Apply Silhouette Formula

S = (b – a)/max(a,b)

Substitute values:

S = {49 - 2}/{49}

S = {47}/{49}

S =0.96

Step 4: Interpret the Result

Silhouette score ≈ 0.96

This means:

• A is very close to its own cluster

• A is very far from the other cluster

So the clustering is very good.


Visual Intuition

Cluster 1
80, 82

Cluster 2
30, 32

Points inside cluster → very close

Clusters themselves → very far apart

So silhouette score becomes close to 1.

Final Summary (Exam Friendly)

Steps to calculate silhouette score:

1. Select a data point.

2. Compute a = average distance to points in the same cluster.

3. Compute b = average distance to nearest other cluster.

4. Apply formula:

S = (b – a)/max(a,b)

5. Interpret the score.

✔ Short definition for exams:

Silhouette score measures how similar a data point is to its own cluster compared to other
clusters.

You might also like