0% found this document useful (0 votes)

12 views10 pages

Understanding Clustering in Data Mining

Clustering in data mining is the process of grouping similar objects into clusters to understand data distribution and structure. It involves various methods such as partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods, each with its own advantages and challenges. Key requirements for effective clustering include scalability, ability to handle different data types, and interpretability of results.

Uploaded by

gmeghanareddy27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

Understanding Clustering in Data Mining

Uploaded by

gmeghanareddy27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

What is clustering in Data Mining?

o Clustering is the method of converting a group of abstract objects into classes of similar
objects.

o Clustering is a method of partitioning a set of data or objects into a set of significant

subclasses called clusters.

o It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-processing
step for other algorithms

What is a Cluster?

o A cluster is a subset of similar objects

o A subset of objects such that the distance between any of the two objects in the cluster is
less than the distance between any object in the cluster and any object that is not located
inside it.

o A connected region of a multidimensional space with a comparatively high density of objects

Requirements of Clustering in Data Mining

Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must contribute to
the popularity of these algorithms. The main issue with the data clustering algorithms is that it
cannot be standardized. The advanced algorithm may give the best results with one type of
data set, but it may fail or perform poorly with other kinds of data set. Although many efforts
have been made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been proposed
so far. However, each algorithm has its advantages or disadvantages and cant work on all real
situations.

1. Scalability: Scalability in clustering implies that as we boost the amount of data objects,
the time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is O(n),
where n is the number of objects in the data. If we raise the number of data objects 10
folds, then the time taken to cluster them should also approximately increase 10 times.
It means there should be a linear relationship. If that is not the case, then there is some
error with our implementation process.
2. Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
3. Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
4. High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
5. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
6. Interpretability − The clustering results should be interpretable, comprehensible

Clustering Methods
Clustering methods can be classified into the following categories –
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements − Each group
contains at least one object. Each object must belong to exactly one group. Points to
remember − For a given number of partitions (say k), the partitioning method will create
an initial partitioning. Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here –
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach This approach is also known as the bottom-up approach.
In this, we start with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.
Divisive Approach This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster. In the continuous iteration, a cluster
is split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or splitting is
done, it can never be undone. Approaches to Improve Quality of Hierarchical
Clustering Here are the two approaches that are used to improve the quality of
hierarchical clustering Perform careful analysis of object linkages at each hierarchical
partitioning. Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then performing
macro-clustering on the micro-clusters.
Density-based Method
• Most partitioning methods cluster objects based on the distance between
objects. Such methods can find only spherical-shaped clusters and encounter
difficulty at discovering clusters of arbitrary shapes.
• Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold; that is, for each data
point within a given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be used to filter out
noise (outliers)and discover clusters of arbitrary shape.
• DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is
a methodthat clusters objects based on the analysis of the value distributions
of density functions
Grid-based Method
• Grid-based methods quantize the object space into a finite number of cells that
form a grid structure.
• All of the clustering operations are performed on the grid structure i.e., on the
quantized space. The main advantage of this approach is its fast processing
time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized
space.
• STING is a typical example of a grid-based method. Wave Cluster applies
wavelet transformation for clustering analysis and is both grid-based and
density-based.
Model-based methods
• Model-based methods hypothesize a model for each of the clusters and find
the best fit of the data to the given model.
• A model-based algorithm may locate clusters by constructing a density function
that reflects the spatial distribution of the data points.
• It also leads to a way of automatically determining the number of clusters based
on standard statistics, taking ―noise‖ or outliers into account and thus yielding
robust clustering methods.
Constraint-based Method
Model-based methods hypothesize a model for each of the clusters and find the best
fit of the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

.
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
Typical methods:
1.k-means Clustering
2.k-medoids Clustering
1. K-Means Clustering Algorithm

K-Means clustering is an unsupervised iterative clustering technique.

• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain similarities

It partitions the data set such that-

• Each data point belongs to a cluster with the nearest mean.
• Data points belonging to one cluster have high degree of similarity.
• Data points belonging to different clusters have high degree of dissimilarity.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following steps
Step-01: • Choose the number of clusters K.
Step-02: • Randomly select any K data points as cluster centres.
• Select cluster centres in such a way that they are as farther as possible from each
other.
Step-03:
Calculate the distance between each data point and each cluster centre.
• The distance may be calculated either by using given distance function or by using
Euclidean distance formula.
Step-04: • Assign each data point to some cluster.
• A data point is assigned to that cluster whose centre is nearest to that data point.
Step-05: • Re-compute the centre of newly formed clusters.
• The centre of a cluster is computed by taking mean of all the data points contained
in that cluster.
Step-06: Keep repeating the procedure from Step-03 to Step-05 until any of the
following stopping criteria is met-
• Centre of newly formed clusters do not change
• Data points remain present in the same cluster

Maximum number of iterations are reached

Advantages

The following are some advantages of K-Means clustering algorithms −

• It is very easy to understand and implement.

• If we have large number of variables then, K-means would be faster than Hierarchical
clustering.
• On re-computation of centroids, an instance can change the cluster.
• Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

The following are some disadvantages of K-Means clustering algorithms −

• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).
• Order of data will have strong impact on the final output.
• It is very sensitive to rescaling. If we will rescale our data by means of normalization
or standardization, then the output will completely change. Final output.
• It is not good in doing clustering job if the clusters have a complicated geometric shape.

K-Means Additional Issues

There are various issues of the K-Means Algorithm which are as follows −

• Handling Empty Clusters − The first issue with the basic K-means algorithm given
prior is that null clusters can be acquired if no points are allocated to a cluster during
the assignment phase. If this occurs, then a method is needed to choose a
replacement centroid, because the squared error will be larger than necessary.
• One method is to select the point that is farthest away from some recent centroid. If
this removes the point that currently contributes some total squared error. Another
method is to select the replacement centroid from the cluster that has the largest SSE.
This will generally divide the cluster and decrease the complete SSE of the clustering.
If there are multiple null clusters, then this process can be repeated multiple times.
• Outliers − When the squared error method is used, outliers can unduly tend to the
clusters that are discovered. In specific, when outliers are present, the resulting cluster
centroids (prototypes) cannot be as representative as they can be, and thus, the SSE
will be higher as well.
• It is beneficial to find outliers and remove them beforehand. It is essential to appreciate
that there are specific clustering applications for which outliers should not be removed.
When clustering is used for data compression, each point should be clustered, and in
some cases, including financial analysis, probable outliers, e.g.,unusually profitable
users, can be the interesting points.
• Reducing the SSE with Postprocessing − The method to reduce the SSE is to find
more clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE,
but don't require to increase the number of clusters. This is possible because Kmeans
generally converge to a local minimum.
• Various methods are used to "fix-up" the resulting clusters to make a clustering that
has lower SSE. The method is to target on individual clusters because the complete
SSE is easily the total of the SSE contributed by every cluster. It can change the total
SSE by implementing several operations on the clusters, including splitting or merging
clusters.
• One method is to use an alternate cluster splitting and merging procedure. During a
splitting procedure, clusters are divided, while during a merging procedure, clusters
are combined. In this method, it is accessible to withdrawal local SSE minima and
create a clustering solution with the seized number of clusters. The following are some
methods used in the splitting and merging phases which are as follows −

Bisecting k-means Algorithm

Bisecting k-means is a hybrid approach between Divisive Hierarchical Clustering (top
down clustering) and K-means Clustering. Instead of partitioning the data set into K
clusters in each iteration, bisecting k-means algorithm splits one cluster into two sub
clusters at each bisecting step (by using k-means) until k clusters are obtained.
Divisive Hierarchical Clustering

Divisive Clustering starts at the top level with a single cluster and divides it towards the
bottom level. In order to decide which clusters should be split, a measure of
dissimilarity between sets of observations is required. This is achieved by using a
measure of distance between pairs of observations.
How Bisecting K-means Work
1. Set K to define the number of cluster
2. Set all data as a single cluster
3. Use K-means with K=2 to split the cluster

4. Measure the distance for each intra cluster.

- Sum of square Distance

5. Select the cluster that have the largest distance and split to 2 cluster using K-means.
6. Repeat step 3–5 until the number of leaf cluster = K.

Final cluster is cluster [C,D,E,F]

Advantage of Using B-Kmeans over Kmeans

1. Bisecting k-means is more efficient when K is large.
For the kmeans algorithm, the computation involves every data point of the data set
and k centroids. On the other hand, in each Bisecting step of Bisecting k-means, only
the data points of one cluster and two centroids are involved in the computation. Thus,
the computation time is reduced.
2. Bisecting k-means produce clusters of similar sizes, while k-means is known to
produce clusters of widely different sizes.

Cluster Analysis in Machine Learning
No ratings yet
Cluster Analysis in Machine Learning
15 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
6 pages
Clustering Introducion
No ratings yet
Clustering Introducion
6 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
14 pages
Data Classification Types in Mining
No ratings yet
Data Classification Types in Mining
27 pages
Clustering and Outlier Detection Methods
No ratings yet
Clustering and Outlier Detection Methods
21 pages
Cluster Analysis Methods Explained
No ratings yet
Cluster Analysis Methods Explained
31 pages
Cluster Analysis: Methods & Applications
No ratings yet
Cluster Analysis: Methods & Applications
17 pages
Understanding Clustering Methods in Data Analysis
No ratings yet
Understanding Clustering Methods in Data Analysis
18 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
9 pages
DM Cluster Analysis
No ratings yet
DM Cluster Analysis
3 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
15 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
52 pages
Understanding Clustering in Data Mining
No ratings yet
Understanding Clustering in Data Mining
19 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
69 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
18 pages
Understanding Clustering in Data Mining
No ratings yet
Understanding Clustering in Data Mining
12 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
21 pages
Understanding Cluster Analysis in Data Mining
No ratings yet
Understanding Cluster Analysis in Data Mining
26 pages
Clustering Techniques and Applications
No ratings yet
Clustering Techniques and Applications
27 pages
CLARA vs. CLARANS in Clustering
No ratings yet
CLARA vs. CLARANS in Clustering
26 pages
Understanding Cluster Detection Methods
No ratings yet
Understanding Cluster Detection Methods
18 pages
UNIT
No ratings yet
UNIT
47 pages
Clustering Methods
No ratings yet
Clustering Methods
7 pages
Clustering Methods and Applications Guide
No ratings yet
Clustering Methods and Applications Guide
16 pages
Module 5
No ratings yet
Module 5
26 pages
Cluster Analysis Basics Explained
No ratings yet
Cluster Analysis Basics Explained
29 pages
Understanding Clustering in Data Analysis
No ratings yet
Understanding Clustering in Data Analysis
6 pages
Clustering Methods in Data Mining Survey
No ratings yet
Clustering Methods in Data Mining Survey
5 pages
Understanding K-Means Clustering
No ratings yet
Understanding K-Means Clustering
22 pages
Cluster Analysis
No ratings yet
Cluster Analysis
23 pages
DMML Unit5
No ratings yet
DMML Unit5
222 pages
Classification vs Clustering Explained
No ratings yet
Classification vs Clustering Explained
31 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
91 pages
Comparative Study of Clustering Methods
No ratings yet
Comparative Study of Clustering Methods
7 pages
Unit 5 Materials
No ratings yet
Unit 5 Materials
65 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
31 pages
ML Unit 4
No ratings yet
ML Unit 4
48 pages
Cluster Analysis: Concepts & Methods
No ratings yet
Cluster Analysis: Concepts & Methods
14 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
11 pages
Understanding Clustering Algorithms
No ratings yet
Understanding Clustering Algorithms
40 pages
Data Mining Clustering Techniques Explained
No ratings yet
Data Mining Clustering Techniques Explained
3 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
7 pages
Cluster Analysis in Machine Learning
No ratings yet
Cluster Analysis in Machine Learning
17 pages
Introduction to Cluster Analysis Methods
No ratings yet
Introduction to Cluster Analysis Methods
57 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
53 pages
Practical Software Testing
No ratings yet
Practical Software Testing
3 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
21 pages
Clustering Techniques in Business Analytics
No ratings yet
Clustering Techniques in Business Analytics
8 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
10 pages
Cluster Analysis in Web Data Mining
No ratings yet
Cluster Analysis in Web Data Mining
17 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
26 pages
Cluster Analysis Methods in Data Mining
No ratings yet
Cluster Analysis Methods in Data Mining
41 pages
Cluster Analysis Methods and Concepts
No ratings yet
Cluster Analysis Methods and Concepts
106 pages
Clustering Methods in Data Mining
No ratings yet
Clustering Methods in Data Mining
15 pages
Module 4
No ratings yet
Module 4
21 pages
Clustering Analysis
No ratings yet
Clustering Analysis
7 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
4 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
11 pages
Algebra 2 Formula Reference Sheet
100% (1)
Algebra 2 Formula Reference Sheet
1 page
Step Radius R Specifications and Pricing
No ratings yet
Step Radius R Specifications and Pricing
1 page
Mean Volume Analysis of Perfume Bottles
100% (1)
Mean Volume Analysis of Perfume Bottles
10 pages
Understanding P and NP Complexity Classes
No ratings yet
Understanding P and NP Complexity Classes
9 pages
International GCSE Maths Mark Scheme 2019
No ratings yet
International GCSE Maths Mark Scheme 2019
22 pages
Year 4 Place Value in 4-Digit Numbers
No ratings yet
Year 4 Place Value in 4-Digit Numbers
36 pages
First Degree Function Exercises and Problems
No ratings yet
First Degree Function Exercises and Problems
4 pages
High-Dimensional Data Analysis Techniques
No ratings yet
High-Dimensional Data Analysis Techniques
322 pages
Evaluating Worldview Test Consistency
No ratings yet
Evaluating Worldview Test Consistency
46 pages
Leibniz's Contributions to Calculus
No ratings yet
Leibniz's Contributions to Calculus
5 pages
Co-Channel vs. Adjacent Channel Interference
No ratings yet
Co-Channel vs. Adjacent Channel Interference
6 pages
AGC Optimization in Power Systems
No ratings yet
AGC Optimization in Power Systems
15 pages
How To Use Abaqus CEL To Model Air Pressure
No ratings yet
How To Use Abaqus CEL To Model Air Pressure
9 pages
The Role of Mathematics in Daily Life
No ratings yet
The Role of Mathematics in Daily Life
2 pages
B.A./B.Sc. Mathematics Exam Paper
No ratings yet
B.A./B.Sc. Mathematics Exam Paper
2 pages
Solving Linear Equations and Graphs
No ratings yet
Solving Linear Equations and Graphs
12 pages
Edited - 6.3 Practice Sheet
No ratings yet
Edited - 6.3 Practice Sheet
4 pages
Supervised Learning Techniques Overview
No ratings yet
Supervised Learning Techniques Overview
166 pages
Moment Distribution Method Explained
No ratings yet
Moment Distribution Method Explained
19 pages
GCF Calculation and River Crossing Puzzle
No ratings yet
GCF Calculation and River Crossing Puzzle
2 pages
Abstract Algebra: Group Exercises
No ratings yet
Abstract Algebra: Group Exercises
207 pages
Sports Losses Impact Stock Returns
No ratings yet
Sports Losses Impact Stock Returns
33 pages
2020-2021 Exam Schedule for IX-XII
No ratings yet
2020-2021 Exam Schedule for IX-XII
1 page
Matrix Concepts for JEE Preparation
No ratings yet
Matrix Concepts for JEE Preparation
144 pages
Advantages of Pressuremeter Testing
No ratings yet
Advantages of Pressuremeter Testing
8 pages
Patterns and Mathematics in Nature
No ratings yet
Patterns and Mathematics in Nature
42 pages
Geometry Area Problems and Solutions
No ratings yet
Geometry Area Problems and Solutions
5 pages
Class 9 Mathematics: Number Systems Guide
No ratings yet
Class 9 Mathematics: Number Systems Guide
144 pages
Data Structures and Python Basics
No ratings yet
Data Structures and Python Basics
31 pages
Integral Action in Feedback Control
No ratings yet
Integral Action in Feedback Control
15 pages

Understanding Clustering in Data Mining

Uploaded by

Understanding Clustering in Data Mining

Uploaded by

What is clustering in Data Mining?

o Clustering is a method of partitioning a set of data or objects into a set of significant

o A cluster is a subset of similar objects

o A connected region of a multidimensional space with a comparatively high density of objects

Requirements of Clustering in Data Mining

K-Means clustering is an unsupervised iterative clustering technique.

It partitions the data set such that-

Maximum number of iterations are reached

The following are some advantages of K-Means clustering algorithms −

• It is very easy to understand and implement.

The following are some disadvantages of K-Means clustering algorithms −

K-Means Additional Issues

Bisecting k-means Algorithm

4. Measure the distance for each intra cluster.

Final cluster is cluster [C,D,E,F]

Advantage of Using B-Kmeans over Kmeans

You might also like