0% found this document useful (0 votes)
5 views5 pages

Unit 7

unit 7

Uploaded by

nv1242326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Unit 7

unit 7

Uploaded by

nv1242326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unsupervised Learning for Clustering

1. Basics of Clustering

Introduction and Philosophy

Clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar to each other than to those in other
groups. It is a form of unsupervised learning because the algorithm is not provided
with ground-truth labels; it must find the structure on its own.

●​ From "Machine Learning" (Tom Mitchell): "In unsupervised learning, the learner
is given a set of examples without any labels... The goal might be to create a
classification for the objects."
●​ From "Pattern Recognition" (Rajjan Singhal): Clustering is an "exploratory data
analysis" technique to uncover the "natural grouping" in a set of patterns, points,
or objects.

Key Concepts

●​ Goal: To partition data into meaningful subgroups (clusters).


●​ No Labels: Unlike supervised learning, there are no target outputs or class labels
provided during training.
●​ Similarity/Dissimilarity: The entire process hinges on a measure of similarity.
Data points that are "close" according to a chosen distance metric (like Euclidean
distance) are grouped together.

Common Distance Metrics (from "Pattern Recognition" by Singhal):

●​ Euclidean Distance: The straight-line distance between two points. Most


common for continuous features.
●​ Manhattan Distance: The sum of absolute differences along each dimension.
●​ Cosine Similarity: Measures the cosine of the angle between two vectors.
Excellent for text data or high-dimensional data where magnitude is less
important than orientation.

Real-World Applications for Postgraduates:


●​ Customer Segmentation: Grouping customers based on purchasing behavior for
targeted marketing.
●​ Image Segmentation: Partitioning an image into regions of similar pixels for
computer vision.
●​ Document Clustering: Grouping news articles or research papers by topic without
knowing the categories in advance.
●​ Genetics: Clustering gene expression data to discover groups of genes with
similar functions.

2. K-Means Clustering

This is one of the most popular and simple clustering algorithms. It is an iterative,
centroid-based algorithm.

●​ From "Artificial Intelligence: A Modern Approach" (Russell & Norvig): K-means is


a method for "vector quantization" that aims to partition n observations into k
clusters.

The Algorithm Step-by-Step

The goal is to partition the data into K clusters, where each data point belongs to the
cluster with the nearest mean (cluster center or centroid).

Input:

●​ K: The number of clusters (a hyperparameter you must choose).


●​ Dataset: {x¹, x², ..., xⁿ} where each x is a d-dimensional data point.

Output:

●​ A set of K cluster centroids.


●​ A label for each data point assigning it to one of the K clusters.

The Procedure (as detailed in "Machine Learning" by Tom Mitchell and "Artificial
Intelligence and Machine Learning" by Vinod Chandra & Anand):

1.​ Initialization:
○​ Randomly select K data points from the dataset to serve as the initial
cluster centroids, {μ₁, μ₂, ..., μ }.
2.​ The Iteration Loop (Repeat until convergence):
○​ Assignment Step (Expectation):
■​ For each data point xⁱ in the dataset:
■​ Calculate the distance between xⁱ and each centroid μⱼ
(typically using Euclidean distance).
■​ Assign xⁱ to the cluster whose centroid is the closest.
■​ Formally, assign each point to the cluster j where j = argminⱼ
||xⁱ - μⱼ||².
○​ Update Step (Maximization):
■​ For each cluster j, recalculate its centroid μⱼ as the mean
(average) of all data points currently assigned to that cluster.
■​ Formally, μⱼ = (1 / |Cⱼ|) * Σ xⁱ for all xⁱ in cluster Cⱼ.
3.​ Convergence:
○​ The algorithm is considered to have converged when the centroids no
longer change significantly between iterations, or when the cluster
assignments stabilize.

A Detailed Example: Clustering University Departments by Research Funding

Imagine you are a university administrator and want to group departments based on
two metrics: Annual Research Funding (in millions) and Number of PhD
Graduates per year. You decide to use K-means with K=3 to find low, medium, and
high-performing clusters.

Your Data (simplified):

●​ Dept A: (1.0, 2)
●​ Dept B: (1.5, 3)
●​ Dept C: (3.0, 5)
●​ Dept D: (5.0, 7)
●​ Dept E: (3.5, 5)
●​ Dept F: (4.5, 6)
●​ Dept G: (3.5, 4)

Step 1: Initialization

●​ Let's randomly choose K=3 initial centroids: μ₁ = A(1.0, 2), μ₂ = C(3.0, 5),
μ₃ = F(4.5, 6).

Step 2: First Assignment Step

●​ Calculate distances of each point to all three centroids and assign it to the
closest one.
○​ Point B(1.5, 3):
■​ Dist to μ₁: √((1.5-1.0)² + (3-2)²) = √(0.25+1) ≈ 1.12
■​ Dist to μ₂: √((1.5-3.0)² + (3-5)²) = √(2.25+4) ≈ 2.5
■​ Dist to μ₃: √((1.5-4.5)² + (3-6)²) = √(9+9) ≈ 4.24
■​ Assignment: Cluster 1 (closest to μ₁).
●​ Repeat for all points. The initial clusters might be:
○​ Cluster 1 (μ₁): A, B
○​ Cluster 2 (μ₂): C, E, G
○​ Cluster 3 (μ₃): D, F

Step 3: First Update Step

●​ Recalculate centroids as the mean of the points in each cluster.


○​ New μ₁: Mean of A and B = ((1.0+1.5)/2, (2+3)/2) = (1.25, 2.5)
○​ New μ₂: Mean of C, E, G = ((3.0+3.5+3.5)/3, (5+5+4)/3) = (3.33,
4.67)
○​ New μ₃: Mean of D and F = ((5.0+4.5)/2, (7+6)/2) = (4.75, 6.5)

Iteration 2:

●​ Assignment Step: Reassign all points using the new centroids (1.25, 2.5),
(3.33, 4.67), (4.75, 6.5).
●​ You will likely find that point G(3.5, 4) is now closer to the new μ₂ than to μ₃.
●​ New Clusters: Cluster 1: {A, B}; Cluster 2: {C, E, G}; Cluster 3: {D, F}
●​ Update Step: Recalculate centroids again. They will shift slightly.

The algorithm continues until the centroids and cluster assignments stop changing.
The final result might be:

●​ Cluster 1 (Low): A, B
●​ Cluster 2 (Medium): C, E, G
●​ Cluster 3 (High): D, F

Critical Considerations for Postgraduates

●​ Choosing K (The Elbow Method): A major challenge. The algorithm requires K to


be specified in advance. A common heuristic is the Elbow Method (from
"Machine Learning" by Anuradha & Vincy): Plot the sum of squared distances
from points to their assigned centroids against different values of K. The "elbow"
of the curve, where the rate of decrease sharply changes, is often a good choice
for K.
●​ Sensitivity to Initialization: The random initial centroid selection can lead to
sub-optimal results. Common solutions include running the algorithm multiple
times (K-means++ is a smarter initialization method discussed in advanced
texts).
●​ Assumes Spherical Clusters: K-means works best when clusters are spherical
and roughly equal in size. It performs poorly with complex, non-globular
shapes.

Common questions

Powered by AI

The iterative process in the K-means algorithm refines clusters over time through repeated execution of the Assignment and Update steps until convergence. During the Assignment step, each data point is assigned to the cluster with the closest centroid. In the Update step, centroids are recalculated as the mean of all points in the cluster. By iterating these steps, the algorithm ensures that data points are grouped into clusters of increasing internal similarity, refining the clusters until the centroids and assignments no longer change significantly .

K-means clustering differs from other forms of unsupervised learning primarily in its objective of partitioning data into a specified number of clusters based on similarity, represented by the distance to cluster centroids. Unlike other methods that may not require pre-defined group numbers or might employ different principles, such as dimensionality reduction techniques, K-means focuses on creating clearly separated groups with centroid-based optimization, producing clusters with defined center points and direct assignments of data points to clusters based on proximity .

Postgraduates face several challenges when applying K-means clustering, such as selecting the number of clusters (K), dealing with sensitivity to initial centroids, and assuming spherical cluster shapes, which can lead to poor results with non-globular data. They can address these challenges by using methods like the Elbow Method for determining K, employing multiple runs and smart initialization (e.g., K-means++) to mitigate sensitivity to initial centroids, and applying alternative clustering algorithms that accommodate diverse data shapes and distributions .

Clustering plays a crucial role in exploring natural groupings of data points, facilitating insights in various real-world applications. For instance, in customer segmentation, it groups customers based on purchasing behavior, aiding targeted marketing strategies. In image segmentation, it divides an image into regions of similar pixels, benefiting computer vision tasks. Document clustering organizes articles by topic without predefined categories, supporting efficient information retrieval. In genetics, it groups gene expression data to discover genes with similar functions, advancing biological research .

K-means clustering is sensitive to the initial selection of centroids because a poor choice can lead to suboptimal clustering results, as the algorithm might converge to local minima rather than the global one. This issue can be mitigated by running the algorithm multiple times with different initializations and by using smarter initialization methods such as K-means++, which selects initial centroids in a way that is likely to result in better performance .

Document clustering is particularly effective in scenarios involving large, unstructured datasets where category labels are not predefined, such as organizing news articles, research papers, or social media content by underlying topics. It helps in summarizing information, detecting thematic structures, and facilitating content recommendations or search enhancements. Its effectiveness is heightened where the aim is to explore high-dimensional text data to discover hidden patterns independently of human categorization .

The main goal of clustering in unsupervised learning is to group a set of objects in such a way that objects in the same group, or cluster, are more similar to each other than to those in other groups. It involves partitioning data into meaningful subgroups without any labels, focusing on similarity or dissimilarity measures among data points .

Distance metrics are used in clustering algorithms to quantify the similarity or dissimilarity between data points, which is crucial for grouping similar objects into clusters. Common metrics include Euclidean distance, suitable for continuous features; Manhattan distance, which is the sum of absolute differences along each dimension; and Cosine similarity, ideal for text data or high-dimensional data where the magnitude is less important than orientation .

The assumption of spherical clusters can significantly affect the performance of K-means clustering because it is best suited for datasets where the clusters are spherical and of roughly equal size. This assumption limits K-means in handling clusters of arbitrary shapes or varying sizes, often resulting in poor performance when clusters are non-globular or when the data contains outliers .

The choice of the number of clusters (K) in K-means clustering significantly impacts the results as it determines how the data is grouped. Selecting an inappropriate K can lead to overfitting or underfitting of the data. The Elbow Method is a heuristic used to determine an optimal K, by plotting the sum of squared distances from points to their assigned centroids against different K values. The 'elbow' point, where the rate of decrease sharply changes, is often considered a good choice for K .

You might also like