0% found this document useful (0 votes)
19 views10 pages

Unsupervised Learning: Clustering Explained

Unsupervised learning is a machine learning approach where algorithms identify patterns in unlabelled data without prior guidance. Key techniques include clustering, such as K-means and hierarchical clustering, which group similar data points based on inherent characteristics. The process involves collecting unlabelled data, selecting an algorithm, training the model, grouping data, and interpreting the results for insights.

Uploaded by

disego9711
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Unsupervised Learning: Clustering Explained

Unsupervised learning is a machine learning approach where algorithms identify patterns in unlabelled data without prior guidance. Key techniques include clustering, such as K-means and hierarchical clustering, which group similar data points based on inherent characteristics. The process involves collecting unlabelled data, selecting an algorithm, training the model, grouping data, and interpreting the results for insights.

Uploaded by

disego9711
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unsupervised: clustering, association

Unsupervised learning is a type of machine learning where a computer tries to find patterns in data without
being told what the right answers are.
For example:
• In supervised learning, you teach the computer using examples like “this is a cat” and “this is a
dog.”
• In unsupervised learning, you just give the computer a bunch of pictures of animals — without
labels — and it tries to group similar ones together or find patterns on its own.
So basically, unsupervised learning = finding hidden patterns or groups in data without labels or
guidance.

The image shows set of animals like elephants, camels and cows that represents raw data that the
unsupervised learning algorithm will process.
• The "Interpretation" stage signifies that the algorithm doesn't have predefined labels or categories
for the data. It needs to figure out how to group or organize the data based on inherent patterns.
• An algorithm represents unsupervised learning process which can be clustering, dimensionality
reduction or anomaly detection to identify patterns in the data.
• The processing stage shows the algorithm working on the data.

The output shows the results of the unsupervised learning process. In this case, the algorithm might have
grouped the animals into clusters based on their species (elephants, camels, cows).
Working of Unsupervised Learning
The working of unsupervised machine learning can be explained in these steps:
1. Collect Unlabelled Data
• Gather a dataset without predefined labels or categories.
• Example: Images of various animals without any tags.
2. Select an Algorithm
• Choose a suitable unsupervised algorithm such as clustering like K-Means, association rule learning
like Apriori etc.
3. Train the Model on Raw Data
• Feed the entire unlabelled dataset to the algorithm.
• The algorithm looks for similarities, relationships or hidden structures within the data.
4. Group or Transform Data
• The algorithm organizes data into groups (clusters), rules or lower-dimensional forms without
human input.
• Example: It may group similar animals together or extract key patterns from large datasets.
5. Interpret and Use Results
• Analyze the discovered groups, rules or features to gain insights or use them for further tasks like
visualization, anomaly detection or as input for other models.
Unsupervised Learning Algorithms
Types of unsupervised learning algorithm
1. Clustering Algorithms
Clustering is an unsupervised machine learning technique that groups similar data points together
into clusters based on their characteristics, without using any labelled data. The objective is to
ensure that data points within the same cluster are more similar to each other than to those in
different clusters, enabling the discovery of natural groupings and hidden patterns in complex
datasets.
• Goal: Discover the natural grouping or structure in unlabelled data without predefined categories.
• How: Data points are assigned to clusters based on similarity or distance measures.
• Similarity Measures: Can include Euclidean distance, cosine similarity or other metrics depending
on data type and clustering method.
• Output: Each group is assigned a cluster ID, representing shared characteristics within the cluster.
• For example, if we have customer purchase data, clustering can group customers with similar
shopping habits. These clusters can then be used for targeted marketing, personalized
recommendations or customer segmentation.

The following image shows an example of how clustering works.

The left side of the image shows uncategorized data. On the right side, data has been grouped into clusters
that consist of similar attributes.

Types of clustering in unsupervised machine learning

The main types of clustering in unsupervised machine learning include

• K-means
• hierarchical clustering
K-means
K-Means Clustering is an unsupervised machine learning algorithm that helps group data points into clusters
based on their inherent similarity. Unlike supervised learning, where we train models using labelled data, K-
Means is used when we have data that is not labelled and the goal is to uncover hidden patterns or structures.
For example, an online store can use K-Means to segment customers into groups like "Budget Shoppers,"
"Frequent Buyers," and "Big Spenders" based on their purchase history.

Working of K-Means Clustering


Suppose we are given a data set of items with certain features and values for these features like a vector.
The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm. "k"
represents the number of groups or clusters we want to classify our items into.
The algorithm will categorize the items into "k" groups or clusters of similarity. To calculate that similarity
we will use the Euclidean distance as a measurement.
The algorithm works as follows:
1. Choose the number of clusters (K):
o The letter “K” in K-Means means the number of groups you want to form.
o For example, if K=2, the algorithm will divide the data into 2 clusters.
2. Initialize centroids:
o Pick K random points from the dataset as the starting Centers of the clusters.
o These points are called centroids.
3. Assign data points to the nearest centroid:
o Each data point is assigned to the cluster whose centroid is closest to it (based on distance,
usually Euclidean distance).
4. Update the centroids:
o For each cluster, calculate the average position of all points in that cluster.
o This new average point becomes the new centroid.
5. Repeat:
o Steps 3 and 4 are repeated until the centroids no longer move much, meaning the clusters
are now stable.
Euclidean Distance Formula
Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean Distance between them is
given by using the formula:

Where,
• d is Euclidean Distance,
• (x1, y1) is the Coordinate of the first point,
• (x2, y2) is the Coordinate of the second point.
Example:
Hierarchical Clustering in Machine Learning
Hierarchical clustering is an unsupervised learning technique used to group similar data points into clusters
by building a hierarchy (tree-like structure).
The algorithm builds clusters step by step either by progressively merging smaller clusters or by splitting a
large cluster into smaller ones. The process is often visualized using a dendrogram, which helps to
understand data similarity.
Dendrogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups of data merge
together. The bottom shows each data point as its own group and as we move up, similar groups are
combined. The lower the merge point, the more similar the groups are. It helps us see how things are
grouped step by step.

• At the bottom of the dendrogram the points A, B, C, D, E and F are all separate.
• As we move up, the closest points are merged into a single group.
• The lines connecting the points show how they are progressively merged based on similarity.
• The height at which they are connected shows how similar the points are to each other; the shorter
the line the more similar they are

Types of Hierarchical Clustering

1. Agglomerative Clustering
2. Divisive clustering

Hierarchical Agglomerative Clustering (HAC) is a bottom-up clustering method.


1. Start with individual points:
Each data point begins as its own cluster.
2. Find the closest pair:
The algorithm finds the two clusters that are most similar (closest to each other).
3. Merge them:
Combine those two clusters into one bigger cluster.
4. Repeat:
Keep finding and merging the closest clusters again and again.
5. Finish:
Continue until only one big cluster remains that contains all the data.
6. Create a dendrogram: As the process continues, we can visualize the merging of clusters using a
tree-like diagram called a dendrogram. It shows the hierarchy of how clusters are merged.

Workflow for Hierarchical Divisive clustering:


1. Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically done by
finding the two most dissimilar points in the cluster and using them to separate the data into two
parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process: Choose the cluster
with the most dissimilar points and split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until every data point is its
own cluster or the stopping condition (such as a predefined number of clusters) is met.
Use the distance matrix in Table1 to perform single link and complete link hierarchical clustering. Show your
results by drawing a dendogram. The dendogram should clearly show the order in which the points are
merged.

Combine P1 and P2
Distances after combining P1 and P2

You might also like