0% found this document useful (0 votes)

4 views16 pages

Clustering

The document provides an overview of clustering, specifically focusing on hierarchical and agglomerative clustering methods. It explains the steps involved in agglomerative clustering, the importance of cluster validation, and various validation techniques. Additionally, it discusses the silhouette score as a measure of clustering quality.

Uploaded by

Mohd shami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views16 pages

Clustering

Uploaded by

Mohd shami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1.

Introduction to Clustering

Clustering is a technique used in data analysis and machine learning to group similar objects
together.

Objects in the same cluster are similar, and objects in different clusters are different.

Simple Real-Life Example

Suppose a teacher wants to group students based on marks.

Student Marks

A 90

B 88

C 85

D 40

E 45

F 42

Possible clusters:

Cluster 1 → High marks students

Cluster 2 → Low marks students

This grouping of similar students is called clustering.

2. What is Hierarchical Clustering?

Hierarchical clustering is a method of clustering where clusters are formed step-by-step in a

hierarchical structure (like a tree).

Instead of directly creating clusters, the algorithm builds a tree of clusters.

This tree structure is called a Dendrogram.

Hierarchy means levels.

Example hierarchy in real life:

Living things classification:

Animal
→ Mammal
→ Dog
→ Labrador
This is hierarchical grouping.

3. Types of Hierarchical Clustering

There are two main types of hierarchical clustering:

1. Agglomerative Hierarchical Clustering

• Start with each data point as its own cluster

• Gradually merge clusters together

This is also called Bottom-Up approach.

2. Divisive Hierarchical Clustering

• Start with all data points in one cluster

• Gradually split clusters

This is called Top-Down approach.

In this topic we focus on:

Agglomerative Hierarchical Clustering

4. What is Agglomerative Hierarchical Clustering?

Agglomerative Hierarchical Clustering is a clustering technique where:

• Each data point starts as a separate cluster

• The closest clusters are merged step by step
• This continues until all data points form one cluster

The word Agglomerative means:

“Gradually joining or merging things together.”

5. Real-Life Example

Imagine 6 students standing separately in a playground.

Step 1
Each student stands alone.

Step 2
Students with similar marks stand together.
Step 3
Small groups join to form bigger groups.

Step 4
Eventually all students form one large group.

This process is similar to agglomerative clustering.

6. Steps of Agglomerative Hierarchical Clustering

The algorithm works in the following steps.

Step 1: Start with Individual Clusters

Each data point forms its own cluster.

Example data:

Student Marks

A 10

B 12

C 25

D 27

Initial clusters:

Cluster1 → A
Cluster2 → B
Cluster3 → C
Cluster4 → D

Step 2: Calculate Distance Between Clusters

The algorithm calculates distance between all clusters.

Distance tells how similar or different clusters are.

Common distance measure:

Euclidean Distance

Example:

Distance between A(10) and B(12) = 2

Step 3: Merge the Closest Clusters

The two clusters with smallest distance are merged.

Example:

Distance between A and B = 2

Distance between C and D = 2

Smallest clusters merge.

New clusters:

Cluster1 → A, B
Cluster2 → C
Cluster3 → D

Step 4: Recalculate Distance

Now distances are calculated between new clusters.

Example:

Cluster1 (A,B)
Cluster2 (C)
Cluster3 (D)

Step 5: Repeat the Process

Again merge the closest clusters.

Eventually we get:

Cluster1 → A, B
Cluster2 → C, D

Then finally:

Cluster → A, B, C, D

7. Dendrogram (Tree Diagram)

The clustering process is often shown using a dendrogram.

A dendrogram is a tree-like diagram that shows:

• How clusters merge

• At what distance clusters combine

Example structure:
______ A

|______ B

_____|

| |______ C

|______ D

The height of the branches shows the distance between clusters.

By cutting the dendrogram at a certain level, we can decide number of clusters.

8. Linkage Methods in Agglomerative Clustering

When merging clusters, we must decide how to measure distance between clusters.

Different methods are called linkage methods.

1. Single Linkage

Distance between closest points of two clusters.

Example:

Cluster1 → {A,B}
Cluster2 → {C,D}

Distance = minimum distance between any two points.

2. Complete Linkage

Distance between farthest points of two clusters.

Example:

Distance = maximum distance between points.

3. Average Linkage

Distance = average of all pairwise distances between clusters.

4. Centroid Linkage
Distance between cluster centers (means).

9. Simple Real World Applications

1. Biology

Scientists group animals or plants based on genetic similarity.

Example hierarchy:

Animal
→ Mammal
→ Dog
→ Labrador

2. Document Organization

Search engines group documents:

Cluster 1 → Sports articles

Cluster 2 → Political articles
Cluster 3 → Technology articles

3. Customer Segmentation

Companies group customers based on:

• Age
• Spending behavior
• Interests

4. Image Processing

Pixels with similar colors are grouped together.

Example:

Cluster 1 → Sky pixels

Cluster 2 → Tree pixels
Cluster 3 → Building pixels

10. Advantages of Agglomerative Clustering

✔ Easy to understand
✔ Does not require number of clusters in advance
✔ Produces hierarchical structure
✔ Useful for small datasets

11. Limitations

✖ Computationally expensive for large datasets

✖ Once clusters merge, they cannot be split again
✖ Sensitive to noise and outliers

12. Summary

Concept Meaning

Hierarchical Clustering Clustering with a tree structure

Agglomerative Bottom-up clustering method

Start Each point is its own cluster

Process Merge closest clusters step by step

End All points form one cluster

Visualization Dendrogram

Final Simple Definition

Agglomerative Hierarchical Clustering is a bottom-up clustering method in which each data point
starts as an individual cluster and the closest clusters are repeatedly merged until all points belong
to one cluster.

Cluster Validation

1. First Understand: What is Clustering?

Clustering is a technique in data analysis where we group similar data together.

• Objects inside the same cluster are similar.

• Objects in different clusters are different.

Real-World Example

Imagine a school where students are grouped based on marks.

Student Marks

A 85

B 82

C 30

D 28

Clusters may be:

• Cluster 1: A, B → High marks

• Cluster 2: C, D → Low marks

Here students with similar marks are grouped together.

2. What is Cluster Validation?

After creating clusters, we must check whether the clusters are correct and meaningful.

This checking process is called Cluster Validation.

Simple Definition

Cluster Validation is the process of evaluating how good the clusters are.

It answers questions like:

• Are objects inside clusters really similar?

• Are clusters clearly different from each other?

• Is the clustering meaningful or random?

3. Why Cluster Validation is Important

Sometimes clustering algorithms create wrong or meaningless groups.

Example:

Cluster 1 → A, C
Cluster 2 → B, D

But marks are:

• A = 85

• B = 82

• C = 30
• D = 28

Clearly this grouping is not logical.

Cluster validation helps detect such bad clustering results.

4. Goal of Cluster Validation

A good clustering should have:

1. High similarity within clusters

Objects inside a cluster should be very similar.

Example
Students with similar marks in one group.

2. Low similarity between clusters

Different clusters should be very different.

Example
Top performers and weak students should not be mixed.

5. Types of Cluster Validation

There are three main types:

1. Internal Validation

2. External Validation

3. Relative Validation

6. Internal Validation

Meaning

Internal validation checks cluster quality using only the dataset itself.

It does not require external information.

It measures:

• Distance between points in the same cluster

• Distance between different clusters

Real-Life Example
Suppose a supermarket groups customers based on spending.

Customer Spending

A ₹500

B ₹550

C ₹5000

D ₹5200

Clusters formed:

Cluster 1 → A, B
Cluster 2 → C, D

Customers in each cluster have similar spending, so clustering is good.

Internal validation confirms this.

7. External Validation

Meaning

External validation compares clustering results with known correct labels.

It checks whether clusters match real categories.

Real-Life Example

Suppose emails are categorized into:

• Spam

• Not Spam

If clustering groups emails exactly like these labels, the clustering is accurate.

External validation compares the clusters with true labels.

8. Relative Validation

Meaning

Relative validation compares different clustering results to find the best one.

Example:

We try:
• 2 clusters

• 3 clusters

• 4 clusters

Then validation checks which option gives the best grouping.

Real-Life Example

A company groups customers.

Option 1

• Low spenders

• High spenders

Option 2

• Low spenders

• Medium spenders

• High spenders

Relative validation helps decide which grouping is better.

9. Common Measures Used in Cluster Validation

Some mathematical measures are used to evaluate clusters.

1. Cohesion

Cohesion measures how close data points are inside a cluster.

Lower distance = better cluster.

Example
Students in a cluster have marks:

80, 81, 82

These values are very close → good cohesion.

2. Separation

Separation measures how far clusters are from each other.

Higher distance = better clusters.

Example
Cluster 1 → marks around 80
Cluster 2 → marks around 30

Clusters are clearly different → good separation.

10. Silhouette Score (Simple Idea)

Silhouette score is a popular method for cluster validation.

It measures:

• similarity within cluster

• difference between clusters

Score range:

Value Meaning

Close to 1 Very good clustering

Around 0 Overlapping clusters

Negative Wrong clustering

11. Practical Example (College Admission)

Suppose a college groups students based on entrance scores.

Student Score

A 92

B 90

C 45

D 40

Clusters:

Cluster 1 → A, B (top students)

Cluster 2 → C, D (low scores)

Validation checks:

• Students in cluster are similar

• Clusters are clearly different

So clustering is valid.
12. Advantages of Cluster Validation

1. Helps check cluster quality

2. Detects incorrect clustering

3. Helps choose best clustering method

4. Improves accuracy of data analysis

13. Summary

Concept Meaning

Cluster Validation Checking the quality of clusters

Internal Validation Uses only dataset

External Validation Uses true labels

Relative Validation Compares different clustering results

Cohesion Similarity inside cluster

Separation Difference between clusters

✔ In simple words:
Cluster validation helps us check whether the clusters created by a clustering algorithm are correct,
meaningful, and useful.

Silhouette Score – How to Calculate (Step-by-Step with Simple Example)

Silhouette Score measures how well a data point fits in its cluster.

It checks two things:

1. How close the point is to its own cluster

2. How far the point is from other clusters

The formula for silhouette score of a point is:

S = {b - a}/{ max(a,b)}

Where:

• a = average distance from the point to all other points in the same cluster
• b = average distance from the point to points in the nearest other cluster

Range of silhouette score:

Value Meaning

Close to 1 Very good clustering

Around 0 Clusters overlap

Negative Wrong clustering

Step-by-Step Example

Suppose we have 2 clusters of students based on marks.

Cluster 1

A = 80
B = 82

Cluster 2

C = 30
D = 32

We will calculate the silhouette score for student A.

Step 1: Calculate a (distance inside the cluster)

Distance between A and B

|80 - 82| = 2

Since B is the only other point in cluster 1:

a=2

Step 2: Calculate b (distance to other cluster)

Distances from A to points in Cluster 2

Distance A → C
|80 - 30| = 50

Distance A → D

|80 - 32| = 48

Average distance:

b = {50 + 48}/{2} = 49

Step 3: Apply Silhouette Formula

S = (b – a)/max(a,b)

Substitute values:

S = {49 - 2}/{49}

S = {47}/{49}

S =0.96

Step 4: Interpret the Result

Silhouette score ≈ 0.96

This means:

• A is very close to its own cluster

• A is very far from the other cluster

So the clustering is very good.

Visual Intuition

Cluster 1
80, 82

Cluster 2
30, 32

Points inside cluster → very close

Clusters themselves → very far apart

So silhouette score becomes close to 1.

Final Summary (Exam Friendly)

Steps to calculate silhouette score:

1. Select a data point.

2. Compute a = average distance to points in the same cluster.

3. Compute b = average distance to nearest other cluster.

4. Apply formula:

S = (b – a)/max(a,b)

5. Interpret the score.

✔ Short definition for exams:

Silhouette score measures how similar a data point is to its own cluster compared to other
clusters.

Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
14 pages
Linkage Methods in Hierarchical Clustering
No ratings yet
Linkage Methods in Hierarchical Clustering
23 pages
FALLSEM2025 26 VL ISWE209L 00100 TH 2025-11-14 Module 7 Clsutering Theory and Problem
No ratings yet
FALLSEM2025 26 VL ISWE209L 00100 TH 2025-11-14 Module 7 Clsutering Theory and Problem
40 pages
Cluster Analysis Techniques Explained
No ratings yet
Cluster Analysis Techniques Explained
55 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
88 pages
Clustering Models: K-means & Hierarchical
No ratings yet
Clustering Models: K-means & Hierarchical
32 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
38 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Customer Segmentation Strategies
No ratings yet
Customer Segmentation Strategies
46 pages
Understanding Clustering Algorithms
No ratings yet
Understanding Clustering Algorithms
39 pages
Hierarchical Agglomerative Clustering
No ratings yet
Hierarchical Agglomerative Clustering
31 pages
Cluster Analysis and Algorithms Overview
No ratings yet
Cluster Analysis and Algorithms Overview
41 pages
K-Means vs Hierarchical Clustering
No ratings yet
K-Means vs Hierarchical Clustering
30 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
44 pages
Lect 14pdf
No ratings yet
Lect 14pdf
35 pages
Clustering Techniques and Approaches
No ratings yet
Clustering Techniques and Approaches
36 pages
Cluster Analysis: Concepts & Methods
100% (1)
Cluster Analysis: Concepts & Methods
72 pages
Understanding Clustering Techniques in ML
No ratings yet
Understanding Clustering Techniques in ML
34 pages
UNIT
No ratings yet
UNIT
47 pages
Cluster Distance Calculations
No ratings yet
Cluster Distance Calculations
110 pages
Distance Analysis in Clustering Methods
No ratings yet
Distance Analysis in Clustering Methods
48 pages
Understanding K-Means Clustering
No ratings yet
Understanding K-Means Clustering
24 pages
Hierarchical Clustering Techniques Explained
No ratings yet
Hierarchical Clustering Techniques Explained
44 pages
Unit5 ML Notes
No ratings yet
Unit5 ML Notes
22 pages
Understanding Hierarchical Clustering
No ratings yet
Understanding Hierarchical Clustering
14 pages
Clustering Techniques: Hierarchical & Partitioning
No ratings yet
Clustering Techniques: Hierarchical & Partitioning
10 pages
Hierarchical Clustering Techniques Explained
No ratings yet
Hierarchical Clustering Techniques Explained
28 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
27 pages
AI2026 Clustering
No ratings yet
AI2026 Clustering
77 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
11 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
19 pages
Unsupervised Learning Overview and Techniques
No ratings yet
Unsupervised Learning Overview and Techniques
41 pages
Mixture Models and Clustering Techniques
No ratings yet
Mixture Models and Clustering Techniques
152 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
49 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
39 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
68 pages
Clustering and Time-Series Forecasting Techniques
No ratings yet
Clustering and Time-Series Forecasting Techniques
23 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
60 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
89 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
42 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
49 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
10 pages
Hierarchical Cluster Analysis Overview
No ratings yet
Hierarchical Cluster Analysis Overview
31 pages
Machine Learning Notes Dtu Unit 4 Part 1
No ratings yet
Machine Learning Notes Dtu Unit 4 Part 1
113 pages
Disadvantages of Hierarchical Clustering
No ratings yet
Disadvantages of Hierarchical Clustering
9 pages
Parametric and Semiparametric Density Estimation
No ratings yet
Parametric and Semiparametric Density Estimation
123 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
80 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
30 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
26 pages
WDM - Unit 3
No ratings yet
WDM - Unit 3
56 pages
Clustering Techniques and Evaluation
No ratings yet
Clustering Techniques and Evaluation
40 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
154 pages
Lecture 6
No ratings yet
Lecture 6
32 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
66 pages
Understanding Clustering in Data Science
No ratings yet
Understanding Clustering in Data Science
20 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
12 pages
DWDM U-5
No ratings yet
DWDM U-5
17 pages
Intelligent Health Care Analytics SDD
No ratings yet
Intelligent Health Care Analytics SDD
40 pages
Efficient High-Res Image Generation with VQGAN
No ratings yet
Efficient High-Res Image Generation with VQGAN
10 pages
AI and Machine Learning for Class 9
No ratings yet
AI and Machine Learning for Class 9
6 pages
14 Different Types of Learning in Machine Learning
No ratings yet
14 Different Types of Learning in Machine Learning
32 pages
What Is Artificial Intelligence - Introduction, History & Types of AI
No ratings yet
What Is Artificial Intelligence - Introduction, History & Types of AI
15 pages
Sentiment Analysis of Financial Tweets
No ratings yet
Sentiment Analysis of Financial Tweets
15 pages
Freelance AI Content Writer Profile
No ratings yet
Freelance AI Content Writer Profile
2 pages
Rainfall Prediction Model for West Bengal
No ratings yet
Rainfall Prediction Model for West Bengal
6 pages
Lexical Substitution Resource Overview
No ratings yet
Lexical Substitution Resource Overview
5 pages
AI Ethics in Workplace Integration
No ratings yet
AI Ethics in Workplace Integration
18 pages
Concise Chain-of-Thought in LLMs
No ratings yet
Concise Chain-of-Thought in LLMs
8 pages
Predicting DJIA Intraday Movements
No ratings yet
Predicting DJIA Intraday Movements
9 pages
MetaLife AI: Decentralized Digital Ecosystem
No ratings yet
MetaLife AI: Decentralized Digital Ecosystem
35 pages
Understanding Transformers & Hugging Face
No ratings yet
Understanding Transformers & Hugging Face
2 pages
Python for Machine Learning & Deep Learning
No ratings yet
Python for Machine Learning & Deep Learning
1 page
Machine Learning Midterm Exam Questions
No ratings yet
Machine Learning Midterm Exam Questions
2 pages
Analyzing Search Query Patterns
No ratings yet
Analyzing Search Query Patterns
18 pages
AI Toolkit for C-Suite Executives
No ratings yet
AI Toolkit for C-Suite Executives
5 pages
SOP for MS in Data Analytics at Webster
No ratings yet
SOP for MS in Data Analytics at Webster
4 pages
Machine Learning for Energy Theft Detection
No ratings yet
Machine Learning for Energy Theft Detection
14 pages
AI Internship Report Overview
No ratings yet
AI Internship Report Overview
33 pages
AI, ML, DL: Concepts and Examples Explained
No ratings yet
AI, ML, DL: Concepts and Examples Explained
11 pages
AI & ML Post Graduate Certificate Program
No ratings yet
AI & ML Post Graduate Certificate Program
32 pages
Deep Learning for Chaotic Channel Estimation
No ratings yet
Deep Learning for Chaotic Channel Estimation
83 pages
MNIST Classification with CNN in PyTorch
No ratings yet
MNIST Classification with CNN in PyTorch
6 pages
Zero-Day Attack Detection System Using Autoencoders and Isolation Forest An Unsupervised Machine Learning Approach
No ratings yet
Zero-Day Attack Detection System Using Autoencoders and Isolation Forest An Unsupervised Machine Learning Approach
15 pages
Mathematics of Deep Learning
No ratings yet
Mathematics of Deep Learning
738 pages
Smartcities 09 00040
No ratings yet
Smartcities 09 00040
44 pages
Smart Object Detection Parking System
No ratings yet
Smart Object Detection Parking System
38 pages
Deep Learning Course Notes AD3501
No ratings yet
Deep Learning Course Notes AD3501
11 pages