0% found this document useful (0 votes)

5 views5 pages

Unit 7

unit 7

Uploaded by

nv1242326

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Unit 7

unit 7

Uploaded by

nv1242326

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unsupervised Learning for Clustering

1. Basics of Clustering

Introduction and Philosophy

Clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar to each other than to those in other
groups. It is a form of unsupervised learning because the algorithm is not provided
with ground-truth labels; it must find the structure on its own.

● From "Machine Learning" (Tom Mitchell): "In unsupervised learning, the learner
is given a set of examples without any labels... The goal might be to create a
classification for the objects."
● From "Pattern Recognition" (Rajjan Singhal): Clustering is an "exploratory data
analysis" technique to uncover the "natural grouping" in a set of patterns, points,
or objects.

Key Concepts

● Goal: To partition data into meaningful subgroups (clusters).

● No Labels: Unlike supervised learning, there are no target outputs or class labels
provided during training.
● Similarity/Dissimilarity: The entire process hinges on a measure of similarity.
Data points that are "close" according to a chosen distance metric (like Euclidean
distance) are grouped together.

Common Distance Metrics (from "Pattern Recognition" by Singhal):

● Euclidean Distance: The straight-line distance between two points. Most

common for continuous features.
● Manhattan Distance: The sum of absolute differences along each dimension.
● Cosine Similarity: Measures the cosine of the angle between two vectors.
Excellent for text data or high-dimensional data where magnitude is less
important than orientation.

Real-World Applications for Postgraduates:

● Customer Segmentation: Grouping customers based on purchasing behavior for
targeted marketing.
● Image Segmentation: Partitioning an image into regions of similar pixels for
computer vision.
● Document Clustering: Grouping news articles or research papers by topic without
knowing the categories in advance.
● Genetics: Clustering gene expression data to discover groups of genes with
similar functions.

2. K-Means Clustering

This is one of the most popular and simple clustering algorithms. It is an iterative,
centroid-based algorithm.

● From "Artificial Intelligence: A Modern Approach" (Russell & Norvig): K-means is

a method for "vector quantization" that aims to partition n observations into k
clusters.

The Algorithm Step-by-Step

The goal is to partition the data into K clusters, where each data point belongs to the
cluster with the nearest mean (cluster center or centroid).

Input:

● K: The number of clusters (a hyperparameter you must choose).

● Dataset: {x¹, x², ..., xⁿ} where each x is a d-dimensional data point.

Output:

● A set of K cluster centroids.

● A label for each data point assigning it to one of the K clusters.

The Procedure (as detailed in "Machine Learning" by Tom Mitchell and "Artificial
Intelligence and Machine Learning" by Vinod Chandra & Anand):

1. Initialization:
○ Randomly select K data points from the dataset to serve as the initial
cluster centroids, {μ₁, μ₂, ..., μ }.
2. The Iteration Loop (Repeat until convergence):
○ Assignment Step (Expectation):
■ For each data point xⁱ in the dataset:
■ Calculate the distance between xⁱ and each centroid μⱼ
(typically using Euclidean distance).
■ Assign xⁱ to the cluster whose centroid is the closest.
■ Formally, assign each point to the cluster j where j = argminⱼ
||xⁱ - μⱼ||².
○ Update Step (Maximization):
■ For each cluster j, recalculate its centroid μⱼ as the mean
(average) of all data points currently assigned to that cluster.
■ Formally, μⱼ = (1 / |Cⱼ|) * Σ xⁱ for all xⁱ in cluster Cⱼ.
3. Convergence:
○ The algorithm is considered to have converged when the centroids no
longer change significantly between iterations, or when the cluster
assignments stabilize.

A Detailed Example: Clustering University Departments by Research Funding

Imagine you are a university administrator and want to group departments based on
two metrics: Annual Research Funding (in millions) and Number of PhD
Graduates per year. You decide to use K-means with K=3 to find low, medium, and
high-performing clusters.

Your Data (simplified):

● Dept A: (1.0, 2)
● Dept B: (1.5, 3)
● Dept C: (3.0, 5)
● Dept D: (5.0, 7)
● Dept E: (3.5, 5)
● Dept F: (4.5, 6)
● Dept G: (3.5, 4)

Step 1: Initialization

● Let's randomly choose K=3 initial centroids: μ₁ = A(1.0, 2), μ₂ = C(3.0, 5),
μ₃ = F(4.5, 6).

Step 2: First Assignment Step

● Calculate distances of each point to all three centroids and assign it to the
closest one.
○ Point B(1.5, 3):
■ Dist to μ₁: √((1.5-1.0)² + (3-2)²) = √(0.25+1) ≈ 1.12
■ Dist to μ₂: √((1.5-3.0)² + (3-5)²) = √(2.25+4) ≈ 2.5
■ Dist to μ₃: √((1.5-4.5)² + (3-6)²) = √(9+9) ≈ 4.24
■ Assignment: Cluster 1 (closest to μ₁).
● Repeat for all points. The initial clusters might be:
○ Cluster 1 (μ₁): A, B
○ Cluster 2 (μ₂): C, E, G
○ Cluster 3 (μ₃): D, F

Step 3: First Update Step

● Recalculate centroids as the mean of the points in each cluster.

○ New μ₁: Mean of A and B = ((1.0+1.5)/2, (2+3)/2) = (1.25, 2.5)
○ New μ₂: Mean of C, E, G = ((3.0+3.5+3.5)/3, (5+5+4)/3) = (3.33,
4.67)
○ New μ₃: Mean of D and F = ((5.0+4.5)/2, (7+6)/2) = (4.75, 6.5)

Iteration 2:

● Assignment Step: Reassign all points using the new centroids (1.25, 2.5),
(3.33, 4.67), (4.75, 6.5).
● You will likely find that point G(3.5, 4) is now closer to the new μ₂ than to μ₃.
● New Clusters: Cluster 1: {A, B}; Cluster 2: {C, E, G}; Cluster 3: {D, F}
● Update Step: Recalculate centroids again. They will shift slightly.

The algorithm continues until the centroids and cluster assignments stop changing.
The final result might be:

● Cluster 1 (Low): A, B
● Cluster 2 (Medium): C, E, G
● Cluster 3 (High): D, F

Critical Considerations for Postgraduates

● Choosing K (The Elbow Method): A major challenge. The algorithm requires K to

be specified in advance. A common heuristic is the Elbow Method (from
"Machine Learning" by Anuradha & Vincy): Plot the sum of squared distances
from points to their assigned centroids against different values of K. The "elbow"
of the curve, where the rate of decrease sharply changes, is often a good choice
for K.
● Sensitivity to Initialization: The random initial centroid selection can lead to
sub-optimal results. Common solutions include running the algorithm multiple
times (K-means++ is a smarter initialization method discussed in advanced
texts).
● Assumes Spherical Clusters: K-means works best when clusters are spherical
and roughly equal in size. It performs poorly with complex, non-globular
shapes.

Common questions

The iterative process in the K-means algorithm refines clusters over time through repeated execution of the Assignment and Update steps until convergence. During the Assignment step, each data point is assigned to the cluster with the closest centroid. In the Update step, centroids are recalculated as the mean of all points in the cluster. By iterating these steps, the algorithm ensures that data points are grouped into clusters of increasing internal similarity, refining the clusters until the centroids and assignments no longer change significantly .

K-means clustering differs from other forms of unsupervised learning primarily in its objective of partitioning data into a specified number of clusters based on similarity, represented by the distance to cluster centroids. Unlike other methods that may not require pre-defined group numbers or might employ different principles, such as dimensionality reduction techniques, K-means focuses on creating clearly separated groups with centroid-based optimization, producing clusters with defined center points and direct assignments of data points to clusters based on proximity .

Postgraduates face several challenges when applying K-means clustering, such as selecting the number of clusters (K), dealing with sensitivity to initial centroids, and assuming spherical cluster shapes, which can lead to poor results with non-globular data. They can address these challenges by using methods like the Elbow Method for determining K, employing multiple runs and smart initialization (e.g., K-means++) to mitigate sensitivity to initial centroids, and applying alternative clustering algorithms that accommodate diverse data shapes and distributions .

Clustering plays a crucial role in exploring natural groupings of data points, facilitating insights in various real-world applications. For instance, in customer segmentation, it groups customers based on purchasing behavior, aiding targeted marketing strategies. In image segmentation, it divides an image into regions of similar pixels, benefiting computer vision tasks. Document clustering organizes articles by topic without predefined categories, supporting efficient information retrieval. In genetics, it groups gene expression data to discover genes with similar functions, advancing biological research .

K-means clustering is sensitive to the initial selection of centroids because a poor choice can lead to suboptimal clustering results, as the algorithm might converge to local minima rather than the global one. This issue can be mitigated by running the algorithm multiple times with different initializations and by using smarter initialization methods such as K-means++, which selects initial centroids in a way that is likely to result in better performance .

Document clustering is particularly effective in scenarios involving large, unstructured datasets where category labels are not predefined, such as organizing news articles, research papers, or social media content by underlying topics. It helps in summarizing information, detecting thematic structures, and facilitating content recommendations or search enhancements. Its effectiveness is heightened where the aim is to explore high-dimensional text data to discover hidden patterns independently of human categorization .

The main goal of clustering in unsupervised learning is to group a set of objects in such a way that objects in the same group, or cluster, are more similar to each other than to those in other groups. It involves partitioning data into meaningful subgroups without any labels, focusing on similarity or dissimilarity measures among data points .

Distance metrics are used in clustering algorithms to quantify the similarity or dissimilarity between data points, which is crucial for grouping similar objects into clusters. Common metrics include Euclidean distance, suitable for continuous features; Manhattan distance, which is the sum of absolute differences along each dimension; and Cosine similarity, ideal for text data or high-dimensional data where the magnitude is less important than orientation .

The assumption of spherical clusters can significantly affect the performance of K-means clustering because it is best suited for datasets where the clusters are spherical and of roughly equal size. This assumption limits K-means in handling clusters of arbitrary shapes or varying sizes, often resulting in poor performance when clusters are non-globular or when the data contains outliers .

The choice of the number of clusters (K) in K-means clustering significantly impacts the results as it determines how the data is grouped. Selecting an inappropriate K can lead to overfitting or underfitting of the data. The Elbow Method is a heuristic used to determine an optimal K, by plotting the sum of squared distances from points to their assigned centroids against different K values. The 'elbow' point, where the rate of decrease sharply changes, is often considered a good choice for K .

Understanding k-Means Clustering Techniques
No ratings yet
Understanding k-Means Clustering Techniques
68 pages
K-Means Clustering Algorithm Explained
No ratings yet
K-Means Clustering Algorithm Explained
14 pages
Clustering Algorithms Overview and Methods
No ratings yet
Clustering Algorithms Overview and Methods
19 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
51 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
45 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
29 pages
8 Clustering 2026
No ratings yet
8 Clustering 2026
166 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
32 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
32 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
11 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
32 pages
Pattern Lec 8
No ratings yet
Pattern Lec 8
21 pages
K-Means Clustering in Unsupervised Learning
No ratings yet
K-Means Clustering in Unsupervised Learning
9 pages
Introduction To K-Means Clustering
No ratings yet
Introduction To K-Means Clustering
28 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
16 pages
K-Medoids Clustering Overview
No ratings yet
K-Medoids Clustering Overview
50 pages
Clustering Techniques in Neural Networks
No ratings yet
Clustering Techniques in Neural Networks
168 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
19 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
53 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
36 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
76 pages
16 KMeans
No ratings yet
16 KMeans
22 pages
Unit 4
No ratings yet
Unit 4
124 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
96 pages
Chapter 4
No ratings yet
Chapter 4
96 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
10 pages
Unsupervised Learning: K-Means & Hierarchical Clustering
No ratings yet
Unsupervised Learning: K-Means & Hierarchical Clustering
17 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
9 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
5 pages
Introduction to Clustering in ML
No ratings yet
Introduction to Clustering in ML
11 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
12 pages
Lecture
No ratings yet
Lecture
23 pages
K Mean Clustering
No ratings yet
K Mean Clustering
31 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
31 pages
W2 Clustering
No ratings yet
W2 Clustering
4 pages
Understanding Clustering in Unsupervised Learning
No ratings yet
Understanding Clustering in Unsupervised Learning
43 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
23 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
16 pages
K-Means Clustering in Unsupervised Learning
No ratings yet
K-Means Clustering in Unsupervised Learning
31 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
25 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
48 pages
Clustering Algorithms and Distance Measures
No ratings yet
Clustering Algorithms and Distance Measures
21 pages
Understanding Clustering in Unsupervised Learning
No ratings yet
Understanding Clustering in Unsupervised Learning
63 pages
Unsupervised Learning: Clustering Methods
No ratings yet
Unsupervised Learning: Clustering Methods
23 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
36 pages
Understanding the EM Algorithm in ML
No ratings yet
Understanding the EM Algorithm in ML
18 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
16 pages
Unsupervised Learning: Clustering & K-Means
No ratings yet
Unsupervised Learning: Clustering & K-Means
8 pages
Understanding Clustering Algorithms
No ratings yet
Understanding Clustering Algorithms
23 pages
Unit4-Clustering and KNN
No ratings yet
Unit4-Clustering and KNN
12 pages
ML Unsupervised Clustering 13
No ratings yet
ML Unsupervised Clustering 13
15 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
18 pages
K-Means Clustering Algorithm Overview
No ratings yet
K-Means Clustering Algorithm Overview
47 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
18 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
31 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
20 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
51 pages
Unsupervised Learning: Clustering & K-Means
No ratings yet
Unsupervised Learning: Clustering & K-Means
10 pages
Community Work Programme Job Openings
No ratings yet
Community Work Programme Job Openings
6 pages
DFMA Course Overview and Guidelines
No ratings yet
DFMA Course Overview and Guidelines
32 pages
Deleting Purchase Orders in SAP Guide
No ratings yet
Deleting Purchase Orders in SAP Guide
3 pages
Basics of CNC Machines Overview
No ratings yet
Basics of CNC Machines Overview
21 pages
Reducing Downtime with Managed Services
No ratings yet
Reducing Downtime with Managed Services
5 pages
CTS Connectionj Check Issue From Hana HXD Connection
No ratings yet
CTS Connectionj Check Issue From Hana HXD Connection
13 pages
Excel Basics for Beginners Guide
No ratings yet
Excel Basics for Beginners Guide
10 pages
Microsoft Visual C++ Overview
No ratings yet
Microsoft Visual C++ Overview
25 pages
Maricopa County Audit Data Deletion Claims
100% (1)
Maricopa County Audit Data Deletion Claims
14 pages
Marriage Registration Details 2023
No ratings yet
Marriage Registration Details 2023
32 pages
Exam Ref Az500 Microsoft Azure Security Technologies 3rd Edition Yuri Diogenes Orin Thomas Download Link
100% (4)
Exam Ref Az500 Microsoft Azure Security Technologies 3rd Edition Yuri Diogenes Orin Thomas Download Link
161 pages
BMTI Dhaka Class Schedule Overview
No ratings yet
BMTI Dhaka Class Schedule Overview
3 pages
Ultimate Career Roadmap for Tech Roles
No ratings yet
Ultimate Career Roadmap for Tech Roles
6 pages
BCB601: AI for Business Exam Paper
No ratings yet
BCB601: AI for Business Exam Paper
4 pages
School of Computer Sciences and Engineering-1
No ratings yet
School of Computer Sciences and Engineering-1
27 pages
Dell Commercial Docking Help Me Choose
No ratings yet
Dell Commercial Docking Help Me Choose
4 pages
Guitar Effects Settings Overview
No ratings yet
Guitar Effects Settings Overview
5 pages
DepEd Project Development Officer II Roles
100% (2)
DepEd Project Development Officer II Roles
2 pages
MSc in Geoinformatics at Trier University
No ratings yet
MSc in Geoinformatics at Trier University
15 pages
Oop PBL
No ratings yet
Oop PBL
16 pages
Cyber Essentials Assessment Guide 2023
No ratings yet
Cyber Essentials Assessment Guide 2023
66 pages
Linear-Time Sorting Algorithms Explained
No ratings yet
Linear-Time Sorting Algorithms Explained
24 pages
Wireshark DNS Query Analysis Lab
No ratings yet
Wireshark DNS Query Analysis Lab
13 pages
AI Unit 5
No ratings yet
AI Unit 5
14 pages
Application of The Octalysis Framework To Gamification Designs For The Elderly
No ratings yet
Application of The Octalysis Framework To Gamification Designs For The Elderly
10 pages
Module 4 Programmable Automation
No ratings yet
Module 4 Programmable Automation
14 pages
Introduction to Statistics and Data Analysis
No ratings yet
Introduction to Statistics and Data Analysis
23 pages
FortiSIEM Cloud Ordering Guide
No ratings yet
FortiSIEM Cloud Ordering Guide
11 pages
techFEST'23: Innovation in Agriculture
No ratings yet
techFEST'23: Innovation in Agriculture
11 pages
Marketing Plan Final
No ratings yet
Marketing Plan Final
48 pages

Unit 7

Uploaded by

Unit 7

Uploaded by

Unsupervised Learning for Clustering

Introduction and Philosophy

● Goal: To partition data into meaningful subgroups (clusters).

Common Distance Metrics (from "Pattern Recognition" by Singhal):

● Euclidean Distance: The straight-line distance between two points. Most

Real-World Applications for Postgraduates:

● From "Artificial Intelligence: A Modern Approach" (Russell & Norvig): K-means is

The Algorithm Step-by-Step

● K: The number of clusters (a hyperparameter you must choose).

● A set of K cluster centroids.

A Detailed Example: Clustering University Departments by Research Funding

Your Data (simplified):

Step 2: First Assignment Step

Step 3: First Update Step

● Recalculate centroids as the mean of the points in each cluster.

Critical Considerations for Postgraduates

● Choosing K (The Elbow Method): A major challenge. The algorithm requires K to

Common questions

Explain how the iterative process in the K-means algorithm ensures that clusters are refined over time.

How does K-means clustering differ from other forms of unsupervised learning in terms of its approach and the outcomes it produces?

What challenges do postgraduates face when applying K-means clustering, and how can they address these challenges effectively?

Discuss the role and significance of clustering in exploring natural groupings of data points in various real-world applications.

Why is K-means clustering sensitive to the initial selection of centroids, and how can this issue be mitigated?

Under what scenarios might document clustering be considered a particularly effective exploratory data analysis technique?

What is the main goal of clustering in unsupervised learning?

How are distance metrics used in clustering algorithms, and what are some common metrics applicable in different contexts?

In what ways can the assumption of spherical clusters affect the performance of K-means clustering?

How can the choice of the number of clusters (K) in K-means clustering impact the results, and what method can help determine an optimal K?

You might also like

Unit 7

Uploaded by

Unit 7

Uploaded by

Unsupervised Learning for Clustering

Introduction and Philosophy

●​ Goal: To partition data into meaningful subgroups (clusters).

Common Distance Metrics (from "Pattern Recognition" by Singhal):

●​ Euclidean Distance: The straight-line distance between two points. Most

Real-World Applications for Postgraduates:

●​ From "Artificial Intelligence: A Modern Approach" (Russell & Norvig): K-means is

The Algorithm Step-by-Step

●​ K: The number of clusters (a hyperparameter you must choose).

●​ A set of K cluster centroids.

A Detailed Example: Clustering University Departments by Research Funding

Your Data (simplified):

Step 2: First Assignment Step

Step 3: First Update Step

●​ Recalculate centroids as the mean of the points in each cluster.

Critical Considerations for Postgraduates

●​ Choosing K (The Elbow Method): A major challenge. The algorithm requires K to

Common questions

Explain how the iterative process in the K-means algorithm ensures that clusters are refined over time.

Explain how the iterative process in the K-means algorithm ensures that clusters are refined over time.

How does K-means clustering differ from other forms of unsupervised learning in terms of its approach and the outcomes it produces?

How does K-means clustering differ from other forms of unsupervised learning in terms of its approach and the outcomes it produces?

What challenges do postgraduates face when applying K-means clustering, and how can they address these challenges effectively?

What challenges do postgraduates face when applying K-means clustering, and how can they address these challenges effectively?

Discuss the role and significance of clustering in exploring natural groupings of data points in various real-world applications.

Discuss the role and significance of clustering in exploring natural groupings of data points in various real-world applications.

Why is K-means clustering sensitive to the initial selection of centroids, and how can this issue be mitigated?

Why is K-means clustering sensitive to the initial selection of centroids, and how can this issue be mitigated?

Under what scenarios might document clustering be considered a particularly effective exploratory data analysis technique?

Under what scenarios might document clustering be considered a particularly effective exploratory data analysis technique?

What is the main goal of clustering in unsupervised learning?

What is the main goal of clustering in unsupervised learning?

How are distance metrics used in clustering algorithms, and what are some common metrics applicable in different contexts?

How are distance metrics used in clustering algorithms, and what are some common metrics applicable in different contexts?

In what ways can the assumption of spherical clusters affect the performance of K-means clustering?

In what ways can the assumption of spherical clusters affect the performance of K-means clustering?

How can the choice of the number of clusters (K) in K-means clustering impact the results, and what method can help determine an optimal K?

How can the choice of the number of clusters (K) in K-means clustering impact the results, and what method can help determine an optimal K?

You might also like

● Goal: To partition data into meaningful subgroups (clusters).

● Euclidean Distance: The straight-line distance between two points. Most

● From "Artificial Intelligence: A Modern Approach" (Russell & Norvig): K-means is

● K: The number of clusters (a hyperparameter you must choose).

● A set of K cluster centroids.

● Recalculate centroids as the mean of the points in each cluster.

● Choosing K (The Elbow Method): A major challenge. The algorithm requires K to