0% found this document useful (0 votes)
5 views60 pages

Unsupervised Learning: Clustering Techniques

Chapter 3 discusses unsupervised learning, focusing on clustering techniques that identify internal structures within unlabeled data. It covers the importance of clustering in various applications, the evaluation of clustering quality through metrics like intra-cluster cohesion and inter-cluster separation, and the K-means algorithm as a popular clustering method. Additionally, it introduces hierarchical clustering as an alternative approach that does not require a predefined number of clusters.

Uploaded by

behaylu tadele
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views60 pages

Unsupervised Learning: Clustering Techniques

Chapter 3 discusses unsupervised learning, focusing on clustering techniques that identify internal structures within unlabeled data. It covers the importance of clustering in various applications, the evaluation of clustering quality through metrics like intra-cluster cohesion and inter-cluster separation, and the K-means algorithm as a popular clustering method. Additionally, it introduces hierarchical clustering as an alternative approach that does not require a predefined number of clusters.

Uploaded by

behaylu tadele
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter-3 Unsupervised Learning

By: Gizachew M.
Unsupervised learning

• Unsupervised learning aims to find the underlying structure or the distribution of data.
• We want to explore the data to find some internal structures in them.

• Learn from inputs x 1 , . . . , x n ∈ Rd without any labels y1 , . . . , yn.


*No predefined classes - Clustering
How it Work?
How it Work?

• Unsupervised learning are of growing importance in a number of fields:


• Subgroups of breast cancer patients grouped by their gene expression measurements,
• Groups of shoppers characterized by their browsing and purchase histories,
• Movies grouped by the ratings assigned by movie viewers.
Type of Unsupervised Machine Learning Algorithm

• Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a
data set that determine the intrinsic grouping in a set of unlabeled data.

Cluster: a collection of data objects

• Similar to one another within the same cluster

• Dissimilar to the objects in other clusters

Cluster analysis Finding similarities between data according to the characteristics found in
the data and grouping similar data objects into clusters
What is Cluster Analysis?

 Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
• Inter-cluster distance: This is the
• It measures how close distance between different clusters.
the data points are to •Maximized: The algorithm aims to
each other inside the make the clusters as far apart from
same group. each other as possible.
Clustering-Applications
Example: Clustering
⚫ The example below demonstrates the clustering of padlocks of same kind.

There are a total of 10 padlocks which various in color, size, shape, etc.

⚫ How many possible clusters of padlocks can be identified?


⚫ There are three different kind of padlocks; which can be grouped into three

different clusters.

⚫ The padlocks of same kind are clustered into a group as shown below.

7
Examples of Clustering Applications
⚫ Marketing: Help marketers discover distinct groups in their customer bases, and the use
this knowledge to develop targeted marketing programs

⚫ Land use: Identification of areas of similar land use in an earth observation database

⚫ Insurance: Identifying groups of motor insurance policy holders with a high average claim
cost
⚫ City-planning: Identifying groups of houses according to their house type, value, and
geographical location

⚫ Earth-quake studies: Observed earth quake epicenters should be clustered along


continent faults
What is Good Clustering

A good clustering method will produce high quality clusters with

• high intra-class similarity

• low inter-class similarity

• The quality of a clustering result depends on both the similarity measure

used by the method and its implementation.


Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Distance measure greatly depends on the type of the attribute
value (binary, nominal, ordinal, etc) that each object will
assume.
Procedures of Cluster Analysis

• A typical cluster analysis consists of four steps


Overview of clustering
 Feature Selection
 Identifying the most effective subset of the original features to use in
clustering
 Feature Extraction
 Transformations of the input features to produce new salient features.
 Feature extraction can be elaborated in the context of dimensionality
reduction and data visualization
 Inter-pattern Similarity: Measured by a distance function defined on pairs of
patterns.
 Grouping: Methods to group similar patterns in the same cluster
Outliers

• Outliers are objects that do not belong to any cluster or form


clusters of very small cardinality

cluster

outliers

•In some applications we are interested in discovering outliers, not


clusters (outlier analysis)
Evaluation
⚫ Intra-cluster cohesion (compactness):

• This measures how compact or tightly packed the data points are within each cluster.
• A good cluster has high cohesion, meaning its members are very similar to each other.
• Metrics: Within-Cluster Sum of Squares (WCSS), Average Intra-cluster Distance.
• WCSS measures the sum of the squared distances between each point in a cluster and its cluster
centroid
• A low WCSS means that, on average, the distances from points to their centroid are small. This
directly confirms your observation: "Points within each cluster are close to the cluster centroid.“
• A high WCSS means that points are, on average, far from their centroid, indicating poor, spread-out
clusters.
• Observation: "Points within each cluster are close to the cluster centroid (e.g., the mean point).“
• Goal: Minimize these distances.
Evaluation

Average Intra-cluster Distance


• This metric provides a more intuitive, direct measure of cluster "tightness.“
• It calculates the average of all the pairwise distances between points within the same cluster
• A low Average Intra-cluster Distance means that points within a cluster are, on average, very
close to each other.
• If points are close to each other, they are also necessarily close to a central point like the
centroid. This strongly supports your observation.
• A high value indicates a loose, dispersed cluster.
Primary Use: It is often used to evaluate and compare the quality of different clustering results
because it is easy to understand: "On average, how far apart are the points in this cluster?"
Evaluation
Inter-cluster separation (isolation):

• This measures how well-separated and distinct the different clusters are from one another.

• A good clustering has high separation, meaning the clusters are clearly defined and don't
blend into each other.
Metrics: Between-Cluster Sum of Squares (BCSS), Silhouette Score, Davies-Bouldin Index.
Observation: "Cluster centroids are far apart from each other."
Goal: Maximize these distances.
• There is a fundamental trade-off between cohesion and separation. The ultimate goal of
clustering is to find a result that optimally balances both, creating tight, distinct groups
• In most applications, expert judgments are still the key.
Internal Index: Measuring clustering validity
•This method evaluates the clustering structure based solely on the data itself,
without any external information or "correct answers.“

•Validate without external info:


• You don't need to know the "right" answer. You are judging the quality based on the intrinsic properti
of the clusters you created.
•With different number of clusters:
•This is the primary use case for internal indices.
•You run your clustering algorithm (like K-Means) multiple times with a different number of cluste
(k=2, k=3, k=4, etc.).
•For each result, you calculate an internal index score.
•Solve the number of clusters
• The "best" number of clusters (k) is often chosen as the one that gives the optimal (highest or lowe
internal index score.
• For example, you pick the k that has the highest Silhouette Score.
Quiz
Q1. What is the difference between supervised and unsupervised machine learning algorithm and
list two or more example for each? 2 point
Q2. what are the evaluation methods for clustering algorithm in unsupervised machine learning
algorithm? 1 point
Q3. what is Clustering algorithm and show the Procedures of Cluster Analysis 2 point
Measuring clustering validity

External Index (Reality)


• This method evaluates the clustering by comparing it to a ground truth or "reality" that you already
know.
• Validate against ground truth:
• You have pre-defined, known labels for your data (e.g., you know which customers actually
churned, or you have hand-labeled categories for documents).
• The clusters are your algorithm's guess, and the ground truth is the correct answer.
• Compare two clusters: (how similar):
• This involves comparing the clustering result you got with the known, true labels.
• It measures how similar they are.
Measuring clustering validity

External Index (Reality)


Common External Indices:
• Adjusted Rand Index (ARI):
• Measures the similarity between two clusterings (yours and the ground truth), corrected for
chance.
• Normalized Mutual Information (NMI):
• Measures the mutual information between the cluster assignments and the true labels,
normalized.
• Fowlkes-Mallows Index (FMI):
• The geometric mean of precision and recall for pairs of points.
Measuring clustering validity
Aspect Internal Index External Index

Judge quality by the inherent Judge quality by comparison to a


Core Idea
structure of the clusters. known truth.
Requires Ground Truth? No Yes
Finding the optimal number of Benchmarking algorithms or
Primary Use Case clusters when you don't know the validating if clusters match a
true categories. known, pre-existing classification.
Cohesion (intra-cluster) vs. Similarity between the clustering
What it Measures
Separation (inter-cluster). and the true labels.
Adjusted Rand Index (ARI),
Silhouette Score, Davies-Bouldin
Example Metrics Normalized Mutual Information
Index
(NMI)
Categories of clustering Algorithms
Partitioning method

• The fundamental idea is to divide a dataset (D) of n objects into exactly k clusters, where k is a
number chosen by the user.
Key Characteristic:
Each object must belong to exactly one cluster (hard clustering), and every cluster must have at least
one object.

 Various approaches have been proposed and some of them are:

 K-means Approach
 K-medoid Approach, CLARA (Custering LARge Application) and CLARANS
(CLustering Algorithm based on RANdomized Search)
K-means clustering

• K-means is a partitional clustering algorithm


• Divides data into non-overlapping clusters
• Each data point belongs to exactly one cluster
• No hierarchical structure between clusters
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X ∈Rr, and r is the
number of attributes (dimensions) in the data.

• The k-means algorithm partitions the given data into k clusters(Key Parameters).

• Each cluster has a cluster center, called centroid.

• k is specified by the user


K-means Application
•A pizza chain wants to open its delivery centers across a city.
•What do you think would be the possible challenges?
✓ They need to analyze the areas from where the pizza is being ordered frequently.
✓ They need to understand as to how many pizza stores have to be opened to cover delivery in the area.
✓ They need to figure out the locations for the pizza stores within all these areas in order to keep the
distance between the store and delivery points minimum.
K-means Application
1. Insurance Fraud Detection
• We make use of past data and also we draw various patterns to make sure that if any
new case comes
• it would definitely incline towards any specific cluster that the algorithms have created.
• It’s an important thing to do as it can really help us to reduce insurance fraud.
2. Search Engines
• The algorithm helps to form various clusters of possible results.
• If any search happens using the search engine, the result will be from a particular
cluster, as search engines have a huge collection of clusters.
• The query would incline to a particular one prompting the algorithm to give out the
accurate result.
3. Customer Segmentation
• Companies find this very beneficial for improving their customer base.
• They would generally create various clusters of types of customers and would
specifically target a few of them to improve their campaign.
K-means Application
4. Crime Hot-Spot Detection
• This is a useful one for any city’s police department.
• Using this model, they would identify specific areas with a high frequency of crimes and can
gain a lot of information from this.
5. Diagnostic Systems
• In the medical profession, professionals deal with various cases of ailments, both severe and
normal.
• This clustering algorithm can help them in various ways like, detect ailments by taking in data of
symptoms.
• They can also help in being smart support systems and help in making better decisions.
K-means algorithm

 Given k, the k-means algorithm is implemented in 4 steps:


Step 1: Select the number K to decide the number of clusters.
Step 2: Select random K points or centroids. (They can be from the input dataset).
Step 3: Assign each data point to its closest centroid, which will form the
predefined KK clusters.
Step 4: Calculate the variance and place a new centroid for each cluster.
Step 5: Repeat the third step; reassign each data point to the new closest centroid of
each cluster.
Step 6: If any reassignment occurs, then go to Step 4. Otherwise, go to FINISH.
Step 7: The model is ready.
Strengths of k-means

• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of
clusters, and t is the number of iterations.

• Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.


Weaknesses of k-means

• The algorithm is only applicable if the mean is defined.


• For categorical data, k-mode - the centroid is represented by most frequent
values.

• The user needs to specify k.

• The algorithm is sensitive to outliers


• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
Example
Initial Data and Centroids:

Student
Height Weight Cluster
No.
1 185 72 K1
2 170 56 K2
3 168 60 K2
4 179 68 K1
5 182 72 ?
6 188 77 ?
7 180 71 ?
8 180 70 ?
9 183 84 ?
10 180 88 ?
11 180 67 ?
12 177 76 ?

Step 1: Decide Initial Centroids


•K1: (185, 72)
•K2: (170, 56)
Example
Initial Data and Centroids:
Student
Height Weight Cluster
No.
1 185 72 K1
2 170 56 K2
3 168 60 K2
4 179 68 K1
5 182 72 ?
6 188 77 ?
7 180 71 ?
8 180 70 ?
9 183 84 ?
10 180 88 ?
11 180 67 ?
12 177 76 ?
Hierarchical Clustering
•Hierarchical clustering is another unsupervised machine learning algorithm.
•It is used to group unlabeled datasets into clusters and is also known as Hierarchical Cluster
Analysis (HCA).
•In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as a dendrogram, , which visually represents the nested clusters.
Why hierarchical clustering?
•As we already have the K-Means Clustering Algorithm,
•We have seen that there are some challenges with this algorithm.
•These are a predetermined number of clusters, and it always tries to create
clusters of the same size.
•To solve these two challenges, we can opt for the hierarchical clustering
algorithm, because in this algorithm, we don't need to have knowledge about the
predefined number of clusters.
Types of hierarchical clustering

• Agglomerative(bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and

• Merges the most similar (or nearest) pair of clusters

• Clusters are repeatedly merged based on their similarity until all data points belong to a single
cluster.

• Stops when all the data points are merged into a single cluster (i.e., the root cluster).

• Divisive (top down) clustering: It starts with all data points in one cluster, the root.

• Splits the root into a set of child clusters. Each child cluster is recursively divided further

• Clusters are recursively split into smaller clusters until each data point is in its own cluster.
• Stops when only singleton clusters of individual data points remain, i.e., each cluster with
only a single point
Types of hierarchical clustering
Step Step Step Step Step
0 1 2 3 4 Agglomerative Nesting
a ab (AGNES)
b abcde
c cde
d
de
e
a
ab
b abcde
c Divisive Analysis
cde
d (DIANA)
de
e
Step Step Step Step Step
4 3 2 1 0
Agglomerative clustering

It is more popular than divisive methods.

• At the beginning, each data point forms a cluster (also called a node).

• Merge nodes/clusters that have the least distance.

• Go on merging

• Eventually all nodes belong to one cluster


Agglomerative clustering algorithm
Measuring the distance of two clusters
Hierarchical Clustering: Problems and Limitations
 Once a decision is made to combine two clusters, it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex shapes
Breaking large clusters
Association Rule Learning
• Association rule mining is a technique used to uncover hidden relationships between variables in
large datasets.
• It is a popular method in data mining and machine learning
• Has a wide range of applications in various fields, such as market basket analysis, customer
segmentation, and fraud detection.
• For example, consider a dataset of transactions at a grocery store.
• Association rule mining could be used to identify relationships between items that are frequently
purchased together.
• For example, the rule "If a customer buys bread, they are also likely to buy milk" is an association
rule that could be mined from this data set.
• We can use such rules to inform decisions about store layout, product placement, and marketing
efforts.
Association Rule Learning
• The resulting rules are often expressed in the form of "if-then" statements,
• Where the "if" part represents the antecedent (the condition being tested) and the
"then" part represents the consequent (the outcome that occurs if the condition is
met).
Reinforcement Learning
• Reinforcement learning is about learning to make decisions.
• Imagine an agent, which could be anything from a software program to a robot,
navigating an environment.
• This environment could be a physical space, a virtual game world, or even a market.
• The agent takes actions within this environment, and those actions can lead to certain
outcomes
Reinforcement Learning

• The goal of the agent is to earn the most rewards possible over time.
• It does this by learning a policy, which is essentially a strategy that tells it what action to
take in any given situation.
• This policy is refined over many iterations of interacting with the environment.
• To illustrate, consider a chess-playing AI.
• The agent's actions are the moves it makes on the chessboard.
• The environment is the current state of the game, and the reward is winning the game.
• Through repeated play and feedback on its moves:
• The RL agent learns which actions are more likely to lead to victory.
Reinforcement learning

▪ Reinforcement learning is the problem faced by an agent that learns behavior through trial-
and-error interactions with a dynamic environment.
▪ Reinforcement Learning is learning how to act in order to maximize a numerical reward.

▪ RL is not a type of neural network, nor is it an alternative to neural networks. Rather, it is


an orthogonal approach for Learning Machine(a design philosophy for building robust,
interpretable, and efficient learning systems)

▪ It emphasizes learning feedback that evaluates the learner's performancewithout providing


standards of correctness in the form of behavioral targets.

Training Info = evaluations (“rewards” / “penalties”)


Inputs RL Outputs (“actions”)
System

Objective: get as much reward as possible


The phenomenon of reinforcement learning
▪ No knowledge of environment
▪ Can only act in the world and observe states and reward

observation action

Ot At

reward Rt

▪ At each step t the agent:


▪ Executes action At
▪ Receives observation Ot
▪ Receives scalar reward Rt
▪ The environment:
▪ Receives action At
▪ Emits observation Ot+1
▪ Emits scalar reward Rt+1
▪ t increment at env. step
Reinforcement learning
▪ Elements of reinforcement learning • The "intelligent program" or learner that is being trained.
▪ Temporally situated • The player in a game, the decision-making AI in a self-driving car, or a
robot learning a task.
▪ Continual learning and planning •To discover which actions yield the most cumulative reward over time.
▪ Object is to affect the environment What it is: A move or decision made
▪ Environment is stochastic and uncertain Policy
by the agent that changes the state
of the environment.
What it is: A snapshot of the situation or •Analogy: Moving a chess piece,
current configuration of the environment
State st Agent turning the steering wheel, or a robot
at a specific time. Action a t moving its arm.
•Analogy: The current positions of all Reward rt •In the Loop: Based on state s₀,
pieces on a chessboard, the current
sensor readings of a robot, or the current Next state s t+1 the agent executes action a₀.
frame of a video game.
•In the Loop: The agent observes
Environment The Reward (r)
state s₀, takes an action, and the • A immediate, numerical feedback signal from the environment that tells the
environment transitions to a new agent how good or bad its last action was.
state s₁. • Winning points in a game, a penalty for crashing, or a positive signal for
completing a step correctly.
•The external world or system with which the agent interacts. • After action a₀, the environment provides reward r₀.
•It is everything the agent can perceive and affect. The Policy (π)
•The game board, the road for a self-driving car, or the physical • The agent's core strategy or "brain." It defines the agent's behavior by
workspace for a robot mapping perceived states to the actions to be taken.
•Analogy: It's the player's strategy or set of rules (e.g., "If my opponent
does X, I should do Y").
•How it's Stored: It can be a simple lookup table or a complex function (like
a neural network).
Reinforcement learning
▪ Elements of reinforcement learning

Policy

Reward
Value Model of
environment • Policy: what to do
• Reward: what is good
• Value: what is good because it
predicts reward
▪ Reward function: • Model: what follows what
▪ Defines the goal in an RL problem
▪ Policy is altered to achieve this goal
▪ The task is to learn an optimal policy that maps states of the world to actions of the agent. I.e., if
this patch of room is dirty, I clean it. If my battery is empty, I
recharge it.
Markov Decision Process
▪ Markov Decision Process (MDP) is a way to describe how a decision-making agent like a robot or
game character moves through different situations while trying to achieve a goal.

▪ It meant to be a straightforward framing of the problem of learning from interaction to


achieve a goal.

▪ MDPs rely on variables such as the environment, agent’s actions and rewards to decide the
system’s next optimal action. It helps us answer questions like:
• What actions should the agent take?
• What happens after an action?
• Is the result good or bad.
Markov Decision Process
▪ At time step t=0, environment samples initial state s0 ~ p(s0)

▪ Then, for t=0 until done:

▪ Agent selects action at

▪ Environment samples reward rt ~ R( . | st, at)

▪ Environment samples next state st+1 ~ P( . | st, at)

▪ Agent receives reward rt and next state st+1

▪ A policy u is a function from S to A that specifies what action to take in each


state

▪ Objective: find policy u* that maximizes cumulative discounted


reward:
Markov Decision Process
▪ A simple MDP: Grid World
states
actions = {

1. right
Set a negative “reward” for each

2. left transition (e.g. r = -1)
3. up
4. Down }
▪ Objective: reach one of terminal states (greyed out) least number of in
actions
▪ Example: Recycle robot MDP
▪ At each step, robot has to decide whether it should (1) actively search for a can,
(2) wait for someone to bring it a can, or (3) go to home base and recharge.
▪ Searching is better but runs down the battery; if runs out of power while searching, has to be
rescued (which is bad).
Markov Decision Process
▪ Example: Recycle robot MDP
▪ Decisions made on basis of current energy level: high, low.
▪ Reward = number of cans collected
Rsearch = expected no. of cans while searching Rwait =
S = {high,low} expected no. of cans while waiting
A( high))=={{search,wait,recharge
A( low search,wait} }
R >R
search wait
Dimensionality reduction techniques

• Another unsupervised learning setting, e.g PCA (Prinicipal component analsysis)


• Dimensionality reduction is simply, the process of reducing the dimension of your
feature set.
• Your feature set could be a dataset with a hundred columns (i.e. features).
Dimensionality reduction techniques

Dimensionality reduction
• Represent the data x1 , . . . , xn ∈ Rd in a subspace of lower dimension
with as little loss of information as possible
• Advantages:
• Visualization
• Lower computation and time complexity
• Avoid overfitting and reduce noise
Dimensionality reduction techniques

• Problem settings:
• Given x 1 , . . . , x n ∈ Rd
• find a k -dimensional subspace
• such that the data projected onto that space
• is as close to the original data as possible

• Common techniques:
• Principal Component Analysis(PCA)
• FactorAnalysis
• Linear Discriminant Analysis( LDA) please read the details of each
• Multiple Correspondence Analysis (MCA)
• t-Distributed Stochastic Neighbour Embedding (t-SNE)
Dimensionality reduction techniques

• The most common and well known dimensionality reduction (PCA,


LDA, FA)
• Principal Component Analysis(PCA): PCA rotates and projects data along
the direction of increasing variance. The features with the maximum variance
are the principal components.
Dimensionality reduction techniques

• Linear Discriminant Analysis( LDA): projects data in a way that the


class separability is maximized. Examples from same class are put
closely together by the projection. Examples from different classes are
placed far apart by the projection.
Dimensionality reduction techniques

• Factor Analysis: a technique that is used to reduce a large number of


variables into fewer numbers of factors.
• The values of observed data are expressed as functions of a number of possible
causes in order to find which are the most important.

• The observations are assumed to be caused by a linear transformation of lower


dimensional latent factors and added Gaussian noise.
Thank you

You might also like