DATA MINING
PRESENTATION ON
CLUSTERING IN DATA MINING
AN OVERVIEW OF TECHNIQUES AND APPLICATIONS
BY
HARSH ROY(22/13071)
DEFINITION AND SCOPE
Definition: Clustering is the process of
grouping a set of objects such that objects in
the same group (cluster) are more similar to
each other than to those in other groups.
Importance: Helps uncover hidden patterns in
data, often used for exploratory data analysis.
Use Cases:
Market segmentation.
Document clustering for information
retrieval.
Image and video processing.
.
Partitioning Methods:
Example: K-means.
Divides data into non-overlapping subsets.
TYPES OF
Hierarchical Methods:
Example: Agglomerative clustering.
Creates a tree-like structure
CLUSTERING
(dendrogram).
Density-Based Methods:
Example: DBSCAN.
Groups data based on density of points.
K-MEANS
CLUSTERING
Algorithm Overview:
a. Initialize K cluster centroids randomly.
b. Assign each data point to the nearest
centroid.
c. Recalculate centroids as the mean of
assigned points.
d. Repeat until centroids stabilize.
Strengths:
Simple and fast for large datasets.
Weaknesses:
Sensitive to the choice of K and outliers.
HIERARCHICAL CLUSTERING
Types:
Agglomerative: Start with individual points, merge clusters.
Divisive: Start with one cluster, split iteratively.
Representation:
Dendrogram shows hierarchical relationships.
Strengths:
No need to specify the number of clusters in advance.
Weaknesses:
Computationally expensive for large datasets.
DENSITY BASED
CLUSTERING
DBSCAN Overview:
Groups points that are closely packed together based on
a specified density threshold.
Points in low-density regions are labeled as noise.
Applications:
Identifying irregularly shaped clusters.
Outlier detection.
Challenges:
Requires setting appropriate parameters (e.g., ε and
MinPts).
CLUSTER EVALUATION METRICS
Cluster Evaluation Metrics
Internal Metrics:
Silhouette Score: Measures how similar an object is to its
own cluster vs. others.
Cohesion: Measures intra-cluster similarity.
External Metrics:
Purity: Fraction of total data correctly assigned to its true
cluster.
Normalized Mutual Information (NMI): Measures shared
information between true and predicted clusters.
Challenges:
No universal metric; depends on the data and goals.
STRENGTHS AND WEAKNESSES OF
CLUSTERING METHODS
K-Means:
Strength: Efficient and simple.
Weakness: Assumes spherical clusters.
Hierarchical:
Strength: Captures nested clusters.
Weakness: Computationally intensive.
DBSCAN:
Strength: Detects clusters of arbitrary
shape.
Weakness: Struggles with varying
densities.
APPLICATIONS OF CLUSTERING
Real-World Applications:
Customer Segmentation:
Group customers based on purchasing behavior.
Image Segmentation:
Cluster pixels for object identification in images.
Climate Data Analysis:
Identify regions with similar weather patterns.
Fraud Detection:
Cluster unusual patterns in transaction data.
CONCLUSION
Summary:
Clustering is a versatile tool in data mining.
Various algorithms cater to different data characteristics.
Future Trends:
Development of more scalable and adaptive clustering methods.
Integration with deep learning techniques.
Closing Statement:
"Clustering continues to play a pivotal role in making sense of
complex datasets."
THANK YOU!
harshroydsc71@[Link]
HARSH ROY ( 22/13071 )