0% found this document useful (0 votes)

7 views105 pages

Python Data Visualization Techniques

The document provides an overview of clustering algorithms, focusing on K-Means and Hierarchical Clustering, detailing their methodologies and applications. It explains the steps involved in K-Means clustering, including centroid selection and data point assignment, as well as the hierarchical approach to clustering that creates a dendrogram. Additionally, it touches on time-series analysis and text analysis techniques, including tokenization and TF-IDF for extracting insights from unstructured data.

Uploaded by

sumitdorle91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views105 pages

Python Data Visualization Techniques

Uploaded by

sumitdorle91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Outcome

Implement data visualization using visualization

tools in Python Programming
Outline

Clustering Algorithms

Text Analysis Model Evaluation and Selection

Clustering

What is Clustering | Cluster analysis

Cluster analysis is a statistical classiﬁcation technique in which a set of

objects or points with similar characteristics are grouped together in clusters.
Clustering

Need of Clustering Algorithms

Trying to determine the appropriate Using Clustering algorithms on the Selling the products to the targeted
audience for the product customer base audience
Clustering Algorithms

❏ K-Means

❏ Hierarchical Clustering ❏ Time-series analysis

Clustering Algorithms

❏ K-Means
Unsupervised learning algorithm

Used to solve the clustering problems

Which groups are unlabeled dataset into different clusters.

Here K deﬁnes the number of predeﬁned clusters that need to be created in the process,
as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
Clustering Algorithms

❏ K-Means

K deﬁnes,

● the number of predeﬁned clusters that need to be created in the process,

as if K=2, there will be two clusters,

for K=3, there will be three clusters,
and so on.
Clustering Algorithms

❏ K-Means

● It is an iterative algorithm that divides the unlabeled dataset into k different clusters
● in such a way that each dataset belongs only one group that has similar properties.

● It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
Clustering Algorithms

❏ K-Means
● It is a centroid-based algorithm, where each cluster is associated with a centroid.

● The main aim of this algorithm is

to minimize the sum of distances
between the data point and their
corresponding clusters.
Clustering Algorithms

❏ K-Means

The k-means clustering algorithm mainly performs two tasks:

● Determines the best value for K center points or centroids by an iterative process.

● Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Clustering Algorithms

❏ K-Means
Clustering Algorithms

❏ K-Means

● Basically K-Means runs on distance calculations, which uses “Euclidean Distance” to calculate
the distance between two given instances.
● For given instances (X1, Y1) and (X2, Y2), the formula is

● Link for Solved Example

● Link for Python Code

Clustering Algorithms
❏ K-Means
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Step 1

Select the number K to decide the number of clusters.

● Let's take number k of clusters, i.e., K=2, to identify the dataset

and to put them into different clusters.
● It means here we will try to group these datasets into two
different clusters.
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Step 2

Select random K points or centroids. (It can be other from the input dataset).
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Assign each data point to their closest centroid,

Step 3
which will form the predeﬁned K clusters.
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Step 4 Calculate the variance and place a new centroid of each

cluster.
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Repeat the third steps, which means reassign each datapoint to the new
Step 5
closest centroid of each cluster.
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Step 6 If any reassignment occurs, then go to step-4 else go to FINISH.

Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?

Step 7 The model is ready.

Clustering Algorithms

❏ Hierarchical Clustering | hierarchical cluster analysis

● unsupervised machine learning algorithm

● used to group the unlabeled datasets into a cluster

Clustering Algorithms

❏ Hierarchical Clustering

● we develop the hierarchy of clusters in the form of a tree

● this tree-shaped structure is known as the dendrogram.

Clustering Algorithms

❏ Hierarchical Clustering

Why hierarchical clustering?

● we can opt for the hierarchical clustering algorithm

● because, in this algorithm, we don't need to have
knowledge about the predeﬁned number of clusters.
Clustering Algorithms

❏ Hierarchical Clustering

● we develop the hierarchy of clusters in the form of a tree

● this tree-shaped structure is known as the dendrogram.

Clustering Algorithms

❏ Hierarchical Clustering

● The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm

starts with taking all data points as single clusters and merging them until one
cluster is left.

2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is

a top-down approach.
Clustering Algorithms

❏ Hierarchical Clustering

Agglomerative Hierarchical clustering

● The agglomerative hierarchical clustering algorithm is a popular example of HCA.

● To group the datasets into clusters, it follows the bottom-up approach.

● It means, this algorithm considers each dataset as a single cluster at the beginning,
and then start combining the closest pair of clusters together.

● It does this until all the clusters are merged into a single cluster that contains all
the datasets.
Clustering Algorithms

❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?

Step 1

Create each data point as a single cluster.

Let's say there are N data points, so the number

of clusters will also be N.
Clustering Algorithms

❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?

Step 2

Take two closest data points or clusters and merge them to

form one cluster.

So, there will now be N-1 clusters.

Clustering Algorithms

❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?

Step 3

Again, take the two closest clusters and merge them

together to form one cluster.

There will be N-2 clusters.

Clustering Algorithms

❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?

Step 4

Repeat Step 3 until only one cluster left. So, we will get the
following clusters.

Consider the images:

Clustering Algorithms

❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?

Step 5

Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per
the problem.
Clustering Algorithms

❏ Hierarchical Clustering

Agglomerative Hierarchical clustering

Measure for the distance between two clusters

● the closest distance between the two clusters is crucial for the hierarchical
clustering.

● There are various ways to calculate the distance between two clusters, and
these ways decide the rule for clustering.

● These measures are called Linkage methods.

Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Single
Linkage Methods Linkage

Complete Linkage Centroid

Linkage Methods Linkage

Average
Linkage
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods

Single ● It is the Shortest Distance between the closest points of the clusters.
Linkage
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods

● It is the farthest distance between the two points of two different clusters.
Complete
Linkage
● It is one of the popular linkage methods as it forms tighter clusters than
single-linkage.
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods

Average ● It is the linkage method in which the distance between each pair of datasets is
Linkage
added up and then divided by the total number of datasets to calculate the
average distance between two clusters.
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods

Centroid ● It is the linkage method in which the distance between the centroid of the
Linkage clusters is calculated.

Reference
Clustering Algorithms
❏ Hierarchical Clustering
Divisive Hierarchical clustering

● This is top Down Strategy does the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster.

● It subdivides the clusters into smaller & smaller pieces, until each object from a cluster on its
own or until it satisﬁes certain termination conditions.

Like , a desired number of cluster or the diameter of each cluster is within a certain threshold
Clustering Algorithms
❏ Hierarchical Clustering

Agglomerative Hierarchical clustering Divisive Hierarchical clustering

● Initially each item in its own ● Initially each item in its one
cluster cluster

● Iteratively cluster are merged ● Large clusters are successively

together divided

● Bottom up
● Top Down
Clustering Algorithms

❏ Time-series analysis

● Time series is a sequence of data points in chronological sequence, most often gathered in regular
intervals.
● It can be applied to any variable that changes over time and generally speaking, usually data
points that are closer together are more similar than those further apart

● It is the way of studying the characteristics of the response variable with respect to time, as the
independent variable

● To estimate the target variable in the name of predicting or forecasting, use the time variable as
the point of reference
Clustering Algorithms

❏ Time-series analysis
Timestamp Stock - Price
Example stock price
2015-10-11 09:00:00 100

2015-10-11 10:00:00 110

2015-10-11 11:00:00 105

Basic structure of time series data 2015-10-11 12:00:00 90

Observations are recorded every hour. 2015-10-11 13:00:00 120

Clustering Algorithms

❏ Time-series analysis
Trend
Components of time series

Components
Seasonality of Irregularity
time series

Cyclical
Clustering Algorithms