0% found this document useful (0 votes)
31 views12 pages

Clustering in Python: Unsupervised Learning

Unsupervised machine learning algorithms like clustering do not have supervision or guidance. Clustering groups similar data points together by finding patterns in unlabeled data. Common clustering algorithms include K-means and Mean Shift. K-means assumes the number of clusters and iteratively assigns data points to centroids until clusters are optimized. Mean Shift does not assume cluster numbers and moves centroids toward areas of higher density until convergence. Clustering performance can be evaluated using silhouette analysis, which calculates how closely data points match their own cluster compared to neighboring clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views12 pages

Clustering in Python: Unsupervised Learning

Unsupervised machine learning algorithms like clustering do not have supervision or guidance. Clustering groups similar data points together by finding patterns in unlabeled data. Common clustering algorithms include K-means and Mean Shift. K-means assumes the number of clusters and iteratively assigns data points to centroids until clusters are optimized. Mean Shift does not assume cluster numbers and moves centroids toward areas of higher density until convergence. Clustering performance can be evaluated using silhouette analysis, which calculates how closely data points match their own cluster compared to neighboring clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

AI with Python - Unsupervised Learning: Clustering

Unsupervised machine learning algorithms do not have any supervisor to provide any sort of
guidance. That is why they are closely aligned with what some call true artificial intelligence.

In unsupervised learning, there would be no correct answer and no teacher for the guidance.
Algorithms need to discover the interesting pattern in data for learning.

What is Clustering?
Basically, it is a type of unsupervised learning method and a common technique for statistical data
analysis used in many fields. Clustering mainly is a task of dividing the set of observations into
subsets, called clusters, in such a way that observations in the same cluster are similar in one
sense and they are dissimilar to the observations in other clusters. In simple words, we can say
that the main goal of clustering is to group the data on the basis of similarity and dissimilarity.

For example, the following diagram shows similar kind of data in different clusters −

Algorithms for Clustering the Data


Following are a few common algorithms for clustering the data −

K-Means algorithm
K-means clustering algorithm is one of the well-known algorithms for clustering the data. We need
to assume that the numbers of clusters are already known. This is also called flat clustering. It is an
iterative clustering algorithm. The steps given below need to be followed for this algorithm −

Step 1 − We need to specify the desired number of K subgroups.

Step 2 − Fix the number of clusters and randomly assign each data point to a cluster. Or in other
words we need to classify our data based on the number of clusters.
[Link] 1/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

In this step, cluster centroids should be computed.

As this is an iterative algorithm, we need to update the locations of K centroids with every iteration
until we find the global optima or in other words the centroids reach at their optimal locations.

The following code will help in implementing K-means clustering algorithm in Python. We are going
to use the Scikit-learn module.

Let us import the necessary packages −

import [Link] as plt

import seaborn as sns; [Link]()

import numpy as np

from [Link] import KMeans

The following line of code will help in generating the two-dimensional dataset, containing four
blobs, by using make_blob from the [Link] package.

from [Link].samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 500, centers = 4,


cluster_std = 0.40, random_state = 0)

We can visualize the dataset by using the following code −

[Link](X[:, 0], X[:, 1], s = 50);

[Link]()

Here, we are initializing kmeans to be the KMeans algorithm, with the required parameter of how
many clusters (n_clusters).

kmeans = KMeans(n_clusters = 4)

We need to train the K-means model with the input data.

[Link] 2/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

[Link](X)

y_kmeans = [Link](X)

[Link](X[:, 0], X[:, 1], c = y_kmeans, s = 50, cmap = 'viridis')

centers = kmeans.cluster_centers_

The code given below will help us plot and visualize the machine's findings based on our data, and
the fitment according to the number of clusters that are to be found.

[Link](centers[:, 0], centers[:, 1], c = 'black', s = 200, alpha = 0.5);

[Link]()

Mean Shift Algorithm


It is another popular and powerful clustering algorithm used in unsupervised learning. It does not
make any assumptions hence it is a non-parametric algorithm. It is also called hierarchical
clustering or mean shift cluster analysis. Followings would be the basic steps of this algorithm −

First of all, we need to start with the data points assigned to a cluster of their own.

Now, it computes the centroids and update the location of new centroids.

By repeating this process, we move closer the peak of cluster i.e. towards the region of
higher density.

This algorithm stops at the stage where centroids do not move anymore.

With the help of following code we are implementing Mean Shift clustering algorithm in Python. We
are going to use Scikit-learn module.

Let us import the necessary packages −

import numpy as np

from [Link] import MeanShift

import [Link] as plt

[Link] 3/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

from matplotlib import style

[Link]("ggplot")

The following code will help in generating the two-dimensional dataset, containing four blobs, by
using make_blob from the [Link] package.

from [Link].samples_generator import make_blobs

We can visualize the dataset with the following code

centers = [[2,2],[4,5],[3,10]]

X, _ = make_blobs(n_samples = 500, centers = centers, cluster_std = 1)

[Link](X[:,0],X[:,1])

[Link]()

Now, we need to train the Mean Shift cluster model with the input data.

ms = MeanShift()

[Link](X)

labels = ms.labels_

cluster_centers = ms.cluster_centers_

The following code will print the cluster centers and the expected number of cluster as per the input
data −

print(cluster_centers)

n_clusters_ = len([Link](labels))

print("Estimated clusters:", n_clusters_)

[[ 3.23005036 3.84771893]

[ 3.02057451 9.88928991]]

Estimated clusters: 2

The code given below will help plot and visualize the machine's findings based on our data, and
the fitment according to the number of clusters that are to be found.

[Link] 4/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

colors = 10*['r.','g.','b.','c.','k.','y.','m.']

for i in range(len(X)):

[Link](X[i][0], X[i][1], colors[labels[i]], markersize = 10)

[Link](cluster_centers[:,0],cluster_centers[:,1],

marker = "x",color = 'k', s = 150, linewidths = 5, zorder = 10)

[Link]()

Measuring the Clustering Performance


The real world data is not naturally organized into number of distinctive clusters. Due to this
reason, it is not easy to visualize and draw inferences. That is why we need to measure the
clustering performance as well as its quality. It can be done with the help of silhouette analysis.

Silhouette Analysis
This method can be used to check the quality of clustering by measuring the distance between the
clusters. Basically, it provides a way to assess the parameters like number of clusters by giving a
silhouette score. This score is a metric that measures how close each point in one cluster is to the
points in the neighboring clusters.

Analysis of silhouette score


The score has a range of [-1, 1]. Following is the analysis of this score −

Score of +1  − Score near +1 indicates that the sample is far away from the neighboring
cluster.
Score of 0 − Score 0 indicates that the sample is on or very close to the decision boundary
between two neighboring clusters.

Score of -1 − Negative score indicates that the samples have been assigned to the wrong
clusters.

[Link] 5/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

Calculating Silhouette Score


In this section, we will learn how to calculate the silhouette score.

Silhouette score can be calculated by using the following formula −

$$silhouette score = \frac{\left ( p-q \right )}{max\left ( p,q \right )}$$

Here, 𝑝 is the mean distance to the points in the nearest cluster that the data point is not a part of.
And, 𝑞 is the mean intra-cluster distance to all the points in its own cluster.

For finding the optimal number of clusters, we need to run the clustering algorithm again by
importing the metrics module from the sklearn package. In the following example, we will run the
K-means clustering algorithm to find the optimal number of clusters −

Import the necessary packages as shown −

import [Link] as plt

import seaborn as sns; [Link]()

import numpy as np

from [Link] import KMeans

With the help of the following code, we will generate the two-dimensional dataset, containing four
blobs, by using make_blob from the [Link] package.

from [Link].samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 500, centers = 4, cluster_std = 0.40, random_state = 0)

Initialize the variables as shown −

scores = []

values = [Link](2, 10)

We need to iterate the K-means model through all the values and also need to train it with the input
data.

for num_clusters in values:

kmeans = KMeans(init = 'k-means++', n_clusters = num_clusters, n_init = 10)

[Link](X)

Now, estimate the silhouette score for the current clustering model using the Euclidean distance
metric −

score = metrics.silhouette_score(X, kmeans.labels_,

metric = 'euclidean', sample_size = len(X))

The following line of code will help in displaying the number of clusters as well as Silhouette score.

print("\nNumber of clusters =", num_clusters)

print("Silhouette score =", score)

[Link](score)

You will receive the following output −

[Link] 6/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

Number of clusters = 9

Silhouette score = 0.340391138371

num_clusters = [Link](scores) + values[0]

print('\nOptimal number of clusters =', num_clusters)

Now, the output for optimal number of clusters would be as follows −

Optimal number of clusters = 2

Finding Nearest Neighbors


If we want to build recommender systems such as a movie recommender system then we need to
understand the concept of finding the nearest neighbors. It is because the recommender system
utilizes the concept of nearest neighbors.

The concept of finding nearest neighbors may be defined as the process of finding the closest
point to the input point from the given dataset. The main use of this KNN)K-nearest neighbors)
algorithm is to build classification systems that classify a data point on the proximity of the input
data point to various classes.

The Python code given below helps in finding the K-nearest neighbors of a given data set −

Import the necessary packages as shown below. Here, we are using


the NearestNeighbors module from the sklearn package

import numpy as np

import [Link] as plt

from [Link] import NearestNeighbors

Let us now define the input data −

A = [Link]([[3.1, 2.3], [2.3, 4.2], [3.9, 3.5], [3.7, 6.4], [4.8, 1.9],

[8.3, 3.1], [5.2, 7.5], [4.8, 4.7], [3.5, 5.1], [4.4, 2.9],])

Now, we need to define the nearest neighbors −

k=3

We also need to give the test data from which the nearest neighbors is to be found −

test_data = [3.3, 2.9]

The following code can visualize and plot the input data defined by us −

[Link]()

[Link]('Input data')

[Link](A[:,0], A[:,1], marker = 'o', s = 100, color = 'black')

[Link] 7/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

Now, we need to build the K Nearest Neighbor. The object also needs to be trained

knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(X)

distances, indices = knn_model.kneighbors([test_data])

Now, we can print the K nearest neighbors as follows

print("\nK Nearest Neighbors:")

for rank, index in enumerate(indices[0][:k], start = 1):

print(str(rank) + " is", A[index])

We can visualize the nearest neighbors along with the test data point

[Link]()

[Link]('Nearest neighbors')

[Link](A[:, 0], X[:, 1], marker = 'o', s = 100, color = 'k')

[Link](A[indices][0][:][:, 0], A[indices][0][:][:, 1],

marker = 'o', s = 250, color = 'k', facecolors = 'none')

[Link](test_data[0], test_data[1],

marker = 'x', s = 100, color = 'k')

[Link]()

[Link] 8/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

Output
K Nearest Neighbors

1 is [ 3.1 2.3]

2 is [ 3.9 3.5]

3 is [ 4.4 2.9]

K-Nearest Neighbors Classifier


A K-Nearest Neighbors (KNN) classifier is a classification model that uses the nearest neighbors
algorithm to classify a given data point. We have implemented the KNN algorithm in the last
section, now we are going to build a KNN classifier using that algorithm.

Concept of KNN Classifier


The basic concept of K-nearest neighbor classification is to find a predefined number, i.e., the 'k' −
of training samples closest in distance to a new sample, which has to be classified. New samples
will get their label from the neighbors itself. The KNN classifiers have a fixed user defined constant
for the number of neighbors which have to be determined. For the distance, standard Euclidean
distance is the most common choice. The KNN Classifier works directly on the learned samples
rather than creating the rules for learning. The KNN algorithm is among the simplest of all machine
learning algorithms. It has been quite successful in a large number of classification and regression
problems, for example, character recognition or image analysis.

Example

We are building a KNN classifier to recognize digits. For this, we will use the MNIST dataset. We
will write this code in the Jupyter Notebook.

Import the necessary packages as shown below.

[Link] 9/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

Here we are using the KNeighborsClassifier module from the [Link] package −

from [Link] import *

import pandas as pd

%matplotlib inline

from [Link] import KNeighborsClassifier

import [Link] as plt

import numpy as np

The following code will display the image of digit to verify what image we have to test −

def Image_display(i):

[Link](digit['images'][i],cmap = 'Greys_r')

[Link]()

Now, we need to load the MNIST dataset. Actually there are total 1797 images but we are using
the first 1600 images as training sample and the remaining 197 would be kept for testing purpose.

digit = load_digits()

digit_d = [Link](digit['data'][0:1600])

Now, on displaying the images we can see the output as follows −

Image_display(0)

Image_display(0)
Image of 0 is displayed as follows −

Image_display(9)
Image of 9 is displayed as follows −

[Link] 10/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

[Link]()
Now, we need to create the training and testing data set and supply testing data set to the KNN
classifiers.

train_x = digit['data'][:1600]

train_y = digit['target'][:1600]

KNN = KNeighborsClassifier(20)

[Link](train_x,train_y)

The following output will create the K nearest neighbor classifier constructor −

KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski',

metric_params = None, n_jobs = 1, n_neighbors = 20, p = 2,

weights = 'uniform')

We need to create the testing sample by providing any arbitrary number greater than 1600, which
were the training samples.

test = [Link](digit['data'][1725])

test1 = [Link](1,-1)

Image_display(1725)

Image_display(6)
Image of 6 is displayed as follows −

Now we will predict the test data as follows −

[Link](test1)

[Link] 11/12
7/31/22, 8:00 PM AI with Python - Unsupervised Learning: Clustering

The above code will generate the following output −

array([6])

Now, consider the following −

digit['target_names']

The above code will generate the following output −

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

[Link] 12/12

Common questions

Powered by AI

A clustering algorithm can preprocess data to form natural groupings, which can then be used to enhance the KNN classifier by reducing dimensionality and noise, improving classification speed, or rebalancing training datasets by creating synthetic samples within clusters. This preprocessing step helps in organizing data more effectively, enhancing the KNN algorithm's accuracy and efficiency by reducing the search space to within or around relevant clusters rather than the entire dataset .

The Mean Shift algorithm identifies clusters by iterating to find regions of high data point density, adjusting centroids towards these peaks. Unlike K-Means, it does not require specifying the number of clusters beforehand, making it more adaptive. Mean Shift is non-parametric and focuses on adaptive centroids, which move to the densest part of the datapoint cloud, whereas K-Means fixes the number of clusters and uses predefined centroids .

Clustering aims to organize a set of observations into groups, or clusters, where observations in the same cluster are similar, and observations in different clusters are dissimilar. Unlike supervised learning, which relies on labeled datasets and a supervisor to guide the learning process, unsupervised learning, like clustering, attempts to find hidden patterns or intrinsic structures in input data without explicit outputs or labels .

The K-Means clustering algorithm partitions data into K clusters, where K is predefined. It follows an iterative process where it initially assigns random centroids for clusters and then refines these centroids by minimizing the variance within each cluster. It is a straightforward and efficient algorithm for large datasets. However, its weaknesses include the need to predefine K, poor performance with non-spherical clusters, sensitivity to initial placements, and ineffectiveness with non-convex shapes .

While KNN is a supervised learning method used for classification tasks based on proximity to k-nearest neighbors in the feature space, clustering algorithms like K-Means and Mean Shift are unsupervised and focus on partitioning data into clusters based on similarity. K-Means assigns data into predefined clusters by optimizing centroids, while Mean Shift iteratively finds peaks of data density without a parameter for the number of clusters. KNN relies on labeled data for classification and does not internally partition the data .

Silhouette analysis is a method used to measure the quality of clustering by determining how similar an object is to its own cluster compared to others. A silhouette score ranges from -1 to 1, where a score near +1 indicates the object is well-matched to its own cluster and poorly matched to neighboring clusters. A score around 0 suggests that the data point is on or close to the boundary between two clusters, and a negative score indicates misclassification .

KNN classifier is a supervised learning technique used to classify data based on the closest training examples in the feature space, effectively using the 'nearby' points to decide the class of an input point. In contrast, clustering algorithms like K-Means and Mean Shift are unsupervised, creating groups from unlabeled data based on similarity without needing predefined labels. KNN's primary goal is classification using labeled data, whereas clustering's goal is pattern discovery and grouping in unlabeled data .

The matplotlib library is used for visualizing data points and clustering results. It helps in plotting datasets and the centroids or cluster outcomes of clustering algorithms, allowing for visual interpretation of how well data points are grouped. This visual feedback is crucial when assessing qualitative aspects of clustering outcomes or debugging algorithmic implementations .

To determine the optimal number of clusters using the silhouette score, one can run the clustering algorithm across a range of different values for the number of clusters (e.g., 2 to 10) and calculate the silhouette score for each. The number of clusters with the highest silhouette score is considered the optimal choice because it indicates that data points are well-clustered internally while being distinctly separate from other clusters .

Mean Shift would be preferred over K-Means when the number of clusters is unknown or the data’s shape does not conform to the spherical nature K-Means assumes. Since Mean Shift does not require a predefined number of clusters and automatically adjusts to the peaks of data density, it is better suited for data with complex, non-convex shapes and unknown distributions. This adaptivity makes it more flexible than K-Means in certain real-world applications .

You might also like