Data Mining
Asfiya Abidi
Assistant Professor
Dr. Akhilesh Das Gupta Institute of Professional Studies (ADGIPS)
Unit – 4 Cluster Analysis
1. Definition
Cluster Analysis (also known as Clustering) is a data mining technique used to group a
set of objects into clusters (or groups) so that:
Objects within the same cluster are similar to each other, and
Objects in different clusters are dissimilar.
In short —
Cluster Analysis is the process of identifying natural groupings or patterns in data based on
similarities or distances among data points.
2. Objective
The main goal of clustering is to discover structure and relationships in data without
using predefined labels. It is a form of unsupervised learning because the algorithm
learns the grouping from the data itself.
Types of Clusters in Data Mining
Example
Type of Cluster Key Concept Shape / Structure Applications
Algorithms
Points within a
K-Means, Clear category
Well-Separated cluster are far from Spherical, distinct
Hierarchical groups
other clusters
Belongs to cluster
Center-Based K-Means, K-Medoids Spherical Market segmentation
with nearest center
Based on
Contiguity-Based connectivity or Hierarchical Chain or tree-like Spatial data
proximity
Dense areas form
Density-Based DBSCAN, OPTICS Arbitrary shape Spatial / image data
clusters
Shared Property Common attribute or
COBWEB Attribute-based Text, documents
(Conceptual) concept
Connected nodes MST, Spectral
Graph-Based Network-based Social networks
form clusters Clustering
Hierarchical Methods of Clustering
Introduction
Hierarchical Clustering is a method of cluster analysis that builds a hierarchy (tree
structure) of clusters.
It does not require specifying the number of clusters in advance and is useful for
understanding the data structure at different levels of granularity.
Definition:
Hierarchical clustering is a clustering technique that forms a tree-like structure (called a
dendrogram) to represent how data points are merged or divided step by step based on
their similarity.
Types of Hierarchical Clustering
(a) Agglomerative Hierarchical Clustering (AHC) – Bottom-Up Approach
Starts with each data point as a single cluster.
At each step, it merges the two closest clusters until all data points belong to one large
cluster.
It is the most common form of hierarchical clustering.
Steps:
Compute the distance matrix between all data points.
Treat each point as its own cluster.
Merge the two clusters with the smallest distance.
Update the distance matrix after each merge.
Repeat until only one cluster remains.
Example:
Suppose we have 4 data points: A, B, C, D
Initially: {A}, {B}, {C}, {D}
After merging step by step:
→ {A, B}, {C}, {D}
→ {A, B}, {C, D}
→ {A, B, C, D}
(b) Divisive Hierarchical Clustering (DHC) – Top-Down Approach
Starts with one large cluster containing all data points.
At each step, it splits the cluster into smaller clusters based on dissimilarity.
Continues splitting until each object is in its own cluster.
Steps:
Start with one big cluster.
Find the cluster with the highest internal dissimilarity.
Divide it into smaller clusters.
Continue until each object is separate.
Note: Divisive methods are less commonly used due to higher computational cost.
Linkage (Distance) Criteria
To decide which clusters to merge or split, different linkage criteria are used:
Linkage Method Description Effect
Distance between the closest Tends to form long, “chain-
Single Linkage
points of two clusters like” clusters
Distance between the farthest Produces compact, evenly
Complete Linkage
points of two clusters sized clusters
Average distance between all Balanced between single
Average Linkage
pairs of points in the two clusters and complete
Distance between the centroids Works well for spherical
Centroid Linkage
of clusters clusters
Minimizes the increase in Produces well-separated
Ward’s Method
variance within clusters clusters
Density-Based Clustering (DBSCAN
Algorithm)
Introduction
Density-Based Clustering is a data mining technique that groups together data points that are
closely packed and separates low-density regions as noise (outliers).
It is especially useful for discovering clusters of arbitrary shape and identifying outliers in a
dataset.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is the most popular density-based clustering algorithm.
DBSCAN Algorithm Steps
Select an unvisited point randomly.
Find all points within Eps distance of this point (its neighborhood).
If neighborhood size ≥ MinPts:
Mark the point as a core point and form a new cluster.
Add all its neighbors to this cluster.
If neighborhood size < MinPts:
Mark it as noise (it might later be assigned to another cluster).
Applications
Field Application
Geographic Information Systems
Detecting densely populated regions or zones.
(GIS)
Astronomy Grouping stars or galaxies.
Fraud Detection Identifying unusual transaction patterns.
Detecting object boundaries or texture
Image Processing
patterns.
Finding customer groups based on location or
Market Analysis
behavior.
Cluster Evaluation
Introduction
After performing clustering, it is important to evaluate the quality and validity of the
clusters formed.
This process is known as Cluster Evaluation or Cluster Validity Analysis.
Definition:
Cluster Evaluation is the process of measuring how well the clustering results represent
the natural grouping structure of the data.
The goal is to determine whether:
The clusters are meaningful, well-separated, and compact, and
The chosen clustering algorithm and parameters are appropriate.
Objectives of Cluster Evaluation
Assess the quality of clustering results.
Compare different clustering algorithms on the same dataset.
Select the optimal number of clusters (k).
Validate the stability and consistency of clusters.
Ensure meaningful interpretation of clusters for decision-making.
Applications of Cluster Evaluation
Selecting the best number of clusters (k) in K-Means.
Comparing performance of clustering algorithms (e.g., DBSCAN vs Hierarchical).
Detecting noise or irrelevant clusters.
Validating segmentation results in marketing, biology, and social science research.
Outlier Analysis
Introduction
In data mining, outlier analysis is the process of identifying data objects that are
significantly different from most of the data.
Such data points are called outliers or anomalies.
Definition:
An outlier is a data point that deviates markedly from other observations in the dataset
and may indicate variability, errors, or unusual events.
Outlier analysis is also known as:
Anomaly Detection
Deviation Detection
Exception Mining
Importance of Outlier Analysis
Outliers are important because they can:
Reveal fraudulent activities (e.g., credit card fraud detection).
Indicate network intrusions or attacks.
Highlight experimental errors or data entry mistakes.
Expose novel or rare events worth investigating.
Improve data quality by identifying and handling noisy data.
Types of Outliers
There are mainly three types of outliers in data mining:
(a) Global Outliers (Point Anomalies)
A data point that lies far from all other data points.
Example: In a dataset of people’s heights (in cm), a value like 300 cm is a global
outlier.
(b) Contextual Outliers (Conditional Anomalies)
A data point that is unusual in a specific context, but not overall.
Example: A temperature of 30°C is normal in summer but an outlier in winter.
(c) Collective Outliers
A group of related data points that collectively deviate from the norm.
Example: A sequence of unusual transactions in a short time could indicate fraud.
Causes of Outliers
Outliers may occur due to:
Data entry or measurement errors
Instrument malfunction
Sampling errors
Natural variation in population
Fraudulent or abnormal behavior
Outlier Detection Methods
(A) Statistical Methods
These methods assume that data follows a specific statistical distribution (like normal
distribution).
Points that lie far from the mean or expected range are treated as outliers.
Advantages:
Simple and interpretable.
Works well for low-dimensional data.
(B) Distance-Based Methods
These methods consider data as points in a multidimensional space.
A point is an outlier if it is far away from most other points.
Advantages:
Does not assume any distribution.
Works well with continuous data.
(C) Density-Based Methods
These methods assume that normal data points occur in dense regions, whereas
outliers appear in sparse regions.
Advantages:
Detects local outliers effectively.
Handles clusters of varying density.
(D) Clustering-Based Methods
These methods use clustering algorithms (like K-Means or DBSCAN) to detect outliers.
Points that do not belong to any cluster or belong to very small clusters are considered
outliers.
Example: DBSCAN marks low-density points as noise (outliers).
Advantages:
Works well for complex and high-dimensional data.
No need for statistical assumptions.
Mining Complex Data Types
Introduction
Traditional data mining techniques were primarily designed for structured data, such as data
stored in relational databases (tables with rows and columns).
However, in the modern era, a large portion of data is complex, semi-structured, or
unstructured, such as text, images, videos, spatial data, and web data.
Definition:
Mining Complex Data Types refers to the process of discovering meaningful patterns,
knowledge, and relationships from non-traditional or complex data sources that go beyond
simple numerical or categorical formats.
Need for Mining Complex Data
Complex data mining is important because:
Data is no longer homogeneous (different types and structures).
A huge amount of multimedia and web data is generated daily.
Traditional algorithms (like classification or clustering) cannot directly handle complex
structures.
Real-world applications (medical images, GPS data, social networks, etc.) require specialized
mining techniques.
Types of Complex Data
(A) Object-Relational and Heterogeneous Databases
Modern databases store complex objects such as images, audio, and documents, not just numbers or
text.
These databases may combine structured, semi-structured, and unstructured data.
Mining requires integrating information from multiple sources and handling heterogeneous
formats.
Example:
A hospital database may contain patient demographics (structured), medical images (unstructured), and
prescriptions (semi-structured).
(B) Spatial Data
Spatial data represents objects in space, such as maps, locations, and coordinates.
Examples include geographic information systems (GIS), satellite data, and environmental data.
Spatial Data Mining involves:
Discovering spatial patterns and relationships (e.g., nearby locations, clusters).
Examples:
Finding accident-prone areas in a city.
Detecting forest fire patterns from satellite images.
Techniques Used:
Spatial clustering
(C) Temporal Data
Temporal data contains time-related information such as stock prices, sensor readings, or weather
data.
The main goal is to discover trends, periodicity, or time-based patterns.
Example:
Predicting electricity usage patterns over time.
Techniques Used:
Time-series analysis
Sequential pattern mining
Trend detection
(D) Spatio-Temporal Data
Combines spatial and temporal dimensions.
Used when data changes both in space and time.
Example:
Tracking the spread of a disease across regions over months.
Techniques Used:
Moving object pattern mining
Spatio-temporal clustering
(E) Text Data
Text data mining (Text Mining) deals with unstructured text documents such as emails, articles, or reviews.
It extracts useful information, such as keywords, sentiment, and topics.
Applications:
Sentiment analysis on social media posts
Document classification and summarization
Techniques Used:
Natural Language Processing (NLP)
Term Frequency–Inverse Document Frequency (TF-IDF)
Topic Modeling (LDA)
(F) Multimedia Data
Includes images, audio, video, and graphics.
Mining multimedia data aims to identify patterns, similarities, and semantic information.
Applications:
Face recognition, image retrieval, video surveillance.
Detecting scenes or emotions in videos.
Techniques Used:
Feature extraction (color, shape, texture)
Content-Based Image Retrieval (CBIR)
Deep Learning (CNNs for images, RNNs for audio/video)
Challenges in Mining Complex Data
High dimensionality – Complex data often involves multiple attributes (space, time,
etc.).
Heterogeneous formats – Combining text, image, and numeric data is difficult.
Scalability – Large datasets require high computational power.
Noise and missing values – Common in multimedia and sensor data.
Semantic understanding – Extracting “meaning” from unstructured data is hard.
Applications
Healthcare: Mining medical images and patient histories.
Finance: Detecting fraudulent transactions using temporal and relational data.
Environment: Analyzing spatial and temporal climate data.
Social Media: Analyzing posts, images, and network relationships.
E-commerce: Recommendation systems using web usage and purchase patterns.
Avoiding False Discoveries
Introduction
In data mining, the goal is to find useful and meaningful patterns from large datasets.
However, sometimes the mining process may produce patterns that appear
significant but are actually random coincidences or occur by chance — these are
known as false discoveries.
Definition:
False discoveries refer to patterns, correlations, or associations that appear to be
statistically significant but do not represent real, meaningful relationships in the
underlying data.
Avoiding false discoveries is essential to ensure that the knowledge extracted is
reliable, valid, and actionable.
Why False Discoveries Occur
False discoveries usually occur due to:
Large datasets: When data is huge, random correlations may seem statistically
significant.
High dimensionality: When many attributes are analyzed simultaneously, the
probability of finding spurious patterns increases.
Overfitting: When a model learns noise instead of the actual data pattern.
Lack of validation: Patterns are not verified with independent data.
Multiple testing: When numerous hypotheses are tested, some may appear “true” just
by chance.
Techniques to Avoid False Discoveries
(A) Cross-Validation
Divides the dataset into training and testing subsets.
The discovered patterns or models are validated on the test set.
If the pattern performs well on unseen data, it is likely genuine.
Example:
In classification, if accuracy remains consistent on test data → pattern is not random.
(B) Statistical Significance Testing
Used to check whether a discovered pattern is statistically significant or just due to
random variation.
Common tests include:
p-value test: A pattern is considered significant if p < 0.05.
Chi-square test: Used to check independence between categorical variables.
Example:
In association rule mining, verify that the rule’s occurrence is statistically different
from what would happen by random chance.
(C) Multiple Hypothesis Testing Correction
When testing many hypotheses, at least a few will appear significant by chance.
To avoid this, corrections such as:
Bonferroni correction
False Discovery Rate (FDR) control
are applied to adjust the significance threshold.
(D) Using Holdout or Validation Datasets
Divide data into three parts: training, validation, and testing.
Use validation data to tune parameters and testing data to check final performance.
Reduces overfitting and prevents false positives.
(E) Data Cleaning and Preprocessing
Remove noise, duplicates, and incomplete data before mining.
Reduces the risk of discovering misleading patterns caused by erroneous data.
(F) Domain Knowledge Verification
Validate discovered patterns with expert knowledge or domain experts.
Helps confirm whether the results are meaningful and make practical sense.
Example:
If a mining algorithm finds that “rain increases ice cream sales,” domain experts can
easily identify this as a false or spurious relation.
(G) Regularization and Simplification
In predictive modeling, simpler models are less likely to overfit.
Techniques like L1/L2 regularization, pruning, or early stopping prevent models
from learning noise in data.
(H) Reproducibility and Re-sampling
Repeat experiments using different samples of data.
If the same patterns appear consistently, they are likely genuine.
Techniques include:
Bootstrap sampling
Repeated random subsampling
Thank You