Understanding Cluster Analysis Methods
Understanding Cluster Analysis Methods
4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as
whereE is the sum of the absolute error for all objects in the data set
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering
4.4.2 Thek-MedoidsAlgorithm:
A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.
where k is the number of clusters, nᵢ is the number of data points in cluster i, xⱼ is a data point in cluster i, and μᵢ is the
centroid of cluster i.
The idea is to look for the “elbow point” on the graph, where the rate of decrease in WCSS starts to slow down.
The number of clusters at the elbow point is o en considered a reasonable choice.
2. The Silhoue e Score:
The silhoue e score measures how similar an object is to its own cluster compared to other clusters.
Mathema cally, the silhoue e score for a data point x is computed as:
where a(x) is the average distance from x to the other data points in the same cluster, and b(x) is the minimum
average distance from x to data points in a different cluster.
A higher silhoue e score indicates that the object is well-matched to its cluster.
3. Gap Sta s cs:
Gap sta s cs compare the performance of your clustering algorithm on the actual data to its performance on
random data (generated under the assump on of no meaningful clusters).
Mathema cally, gap sta s cs involve comparing the observed within-cluster sum of squares to the expected
within-cluster sum of squares under a null hypothesis.
4. Dendrogram Analysis:
A dendrogram is a tree-like diagram that shows the hierarchical rela onships between clusters. It is created by
a clustering algorithm, which starts by trea ng each data point as a separate cluster. The algorithm then
repeatedly merges the two closest clusters un l there is only one cluster le . The dendrogram shows the order
in which the clusters were merged.
The branches of a dendrogram represent the distances between clusters. The closer the two branches are, the
more similar the two clusters are. The point where the branches start to merge less rapidly is the point where
the clusters are becoming more dis nct.
The number of clusters at this point can be a reasonable es mate of the number of clusters in the data.
However, it is important to note that this is just an es mate, and the actual number of clusters may vary
depending on the clustering algorithm and the data set.
5. Predic on Strength and Co-Membership Matrix:
Predic on Strength (PS) assesses the stability of the clusters by repeatedly subsampling the data and clustering
subsets of it.
It measures how o en data points that belong to the same cluster in the original data are s ll clustered
together in the subsamples.
A higher Predic on Strength indicates more stable and robust clusters.
What is Classifica on?
Classifica on is a data mining technique that uses a set of training data to determine the class or category of a new
observa on. This method of supervised learning uses sta s cal and machine learning techniques to create a model
that can categorise fresh data in accordance with the pa erns seen in the training data.
A dataset is split into a training set and a test set for classifica on. The classifica on model is constructed using
the training set, and its effec veness is assessed using the test set.
The classifica on algorithm gains exper se from the training data and applies it to forecast the class of
incoming, untainted data.
Many applica ons, including image recogni on, spam filtering, fraud detec on, and medical diagnosis, heavily
rely on classifica on.
Decision trees, k?nearest neighbours, support vector machines, and neural networks are some common
categoriza on algorithms.
Classifica on can be either "binary classifica on" or "mul nomial classifica on".
When there are exactly two target classes, then it is known as binary classifica on.
When there are more than two target classes, as in the case of pa ern recogni on issues, then it is known
as mul nomial classifica on.
Advantages of Applying Classifica on in Data Mining
Following are the advantages of applying Classifica on in Data Mining:
Predic ve power: In order to forecast the class or category of new data, classifica on can help find pa erns in
data that can be u lised for predic on and decision?making.
Interpretable results: As many classifica on algorithms provide models that are simple to understand, it is
simpler for people to comprehend the logic behind a given classifica on.
Scalability: Classifica on is a scalable data mining technique since it can be used on big datasets.
Versa lity: Classifica on is flexible and broadly applicable since it can be applied to many different forms of
data, including numerical and categorical data.
Disadvantages of Applying Classifica on in Data Mining
Following are the disadvantages of applying Classifica on in Data Mining:
Overfi ng: When a classifica on model fits training data too closely, it is said to be overfit, which leads to
subpar performance on new, untried data.
Bias: Classifica on models may favour some classes or traits over others, which could lead to incorrect
predic ons.
Data quality: Inaccurate or inadequate data can lead to incorrect predic ons, which can have an impact on
how accurate the categoriza on model is.
Complexity: Certain categoriza on algorithms can be quite difficult to develop and interpret because they
need a lot of computer power.
Sensi vity to input data: Classifica on models are some mes suscep ble to changes in the input data, which
can have a major impact on the projected classes.
Segmenta on in big data refers to dividing large datasets into meaningful groups or
segments based on specific characteris cs. This helps in targeted analy cs, improved
decision-making, and efficient data processing. Segmenta on is widely used in areas like
marke ng, healthcare, and fraud detec on.
1. Types of Segmenta on in Big Data
Big data can be segmented based on different criteria:
A. Demographic Segmenta on
Used in customer analy cs, this involves dividing data based on:
- **Age Groups**: Analysing behaviour of different age demographics.
- **Geographical Loca on**: Sor ng data based on country, city, or region.
- **Income Levels**: Understanding consumer spending pa erns.
B. Behavioural Segmenta on
Divides data based on how users interact with systems:
Purchase History: Categorizing customers based on previous transac ons.
Browsing Pa erns: Iden fying frequent website visits and interests.
Usage Frequency: Sor ng users into heavy, moderate, and light usage groups.
C. Temporal Segmenta on
Involves dividing data based on me-dependent behaviours:
Seasonal Trends: Understanding customer behaviour across different seasons.
Event-based Analysis: Studying reac ons before and a er a par cular event.
D. Spa al Segmenta on
Uses loca on-based analy cs:
Heatmaps for Customer Density: Helps retailers analyse foot traffic.
Urban vs. Rural Behaviour: Different purchase pa erns in ci es versus villages.
2. Methods of Segmenta on in Big Data
There are different techniques used to segment big data:
A. Clustering-Based Segmenta on
K-Means Clustering: Groups similar data points into clusters.
Hierarchical Clustering: Builds tree-like segments based on similari es.
DBSCAN (Density-Based Spa al Clustering of Applica ons with Noise): Iden fies clusters
based on density.
B. Machine Learning-Based Segmenta on
Supervised Learning: Uses labelled data to categorize en es.
Unsupervised Learning: Discovers natural groupings in unlabelled data.
C. Rule-Based Segmenta on
- Uses predefined business rules to classify data, such as segmen ng customers based on
purchase frequency.
D. Graph-Based Segmenta on
Community Detec on Algorithms: Finds influen al groups in networks (e.g., social media
analysis).
Link Analysis: Iden fies rela onships among nodes in large datasets.
3. Applica ons of Segmenta on in Big Data
Segmenta on helps businesses make data-driven decisions:
Retail: Personalized marke ng based on customer preferences.
Healthcare: Pa ent risk profiling for targeted treatments.
Fraud Detec on: Iden fying suspicious pa erns in financial transac ons.
Social Media Analy cs: Tracking sen ment trends within user groups.
What is linear regression?
Linear regression is a data analysis technique that predicts the value of unknown data by
using another related and known data value. It mathema cally models the unknown or
dependent variable and the known or independent variable as a linear equa on. For instance,
suppose that you have data about your expenses and income for last year. Linear regression
techniques analyse this data and determine that your expenses are half your income. They
then calculate an unknown future expense by halving a future known income.
How does linear regression work?
At its core, a simple linear regression technique a empts to plot a line graph between two
data variables, x and y. As the independent variable, x is plo ed along the horizontal axis.
Independent variables are also called explanatory variables or predictor variables. The
dependent variable, y, is plo ed on the ver cal axis. You can also refer to y values as response
variables or predicted variables.
Steps in linear regression
For this overview, consider the simplest form of the line graph equa on between y and x;
y=c*x+m, where c and m are constant for all possible values of x and y. So, for example,
suppose that the input dataset for (x,y) was (1,5), (2,8), and (3,11). To iden fy the linear
regression method, you would take the following steps:
1. Plot a straight line, and measure the correla on between 1 and 5.
2. Keep changing the direc on of the straight line for new values (2,8) and (3,11) un l all
values fit.
3. Iden fy the linear regression equa on as y=3*x+2.
4. Extrapolate or predict that y is 14 when x is
What are the types of linear regression?
Some types of regression analysis are more suited to handle complex datasets than others.
The following are some examples.
Simple linear regression
Simple linear regression is defined by the linear func on:
Y= β0*X + β1 + ε
β0 and β1 are two unknown constants represen ng the regression slope, whereas ε (epsilon)
is the error term.
You can use simple linear regression to model the rela onship between two variables, such
as these:
Rainfall and crop yield
Age and height in children
Temperature and expansion of the metal mercury in a thermometer
Mul ple linear regression
In mul ple linear regression analysis, the dataset contains one dependent variable and
mul ple independent variables. The linear regression line func on changes to include more
factors as follows:
Y= β0*X0 + β1X1 + β2X2+…… βnXn+ ε
As the number of predictor variables increases, the β constants also increase correspondingly.
Mul ple linear regression models mul ple variables and their impact on an outcome:
Rainfall, temperature, and fer lizer use on crop yield
Diet and exercise on heart disease
Wage growth and infla on on home loan rates
Logis c regression
Data scien sts use logis c regression to measure the probability of an event occurring. The
predic on is a value between 0 and 1, where 0 indicates an event that is unlikely to happen,
and 1 indicates a maximum likelihood that it will happen. Logis c equa ons use logarithmic
func ons to compute the regression line.
These are some examples:
The probability of a win or loss in a spor ng match
The probability of passing or failing a test
The probability of an image being a fruit or an animal
What is ML Search?
It is the use of machine learning algorithms to:
Improve search relevance
Understand user intent
Personalize search results
Predict queries or autocomplete
Cluster and classify search results
Challenges
Scalability – Must handle billions of documents and real- me search
Training Data – Needs large, labeled datasets
Latency – Real- me or near-real- me performance required
Bias and Fairness – Ensure fair and unbiased results
Privacy – Protect user data during search personaliza on
What is Indexing in DBMS?
Indexing is used to quickly retrieve particular data from the database.
Formally we can define Indexing as a technique that uses data structures to
optimize the searching time of a database query in DBMS. Indexing reduces
the number of disks required to access a particular data by internally creating
an index table.
Indexing is achieved by creating Index-table or Index.
Index usually consists of two columns which are a key-value pair. The two
columns of the index table(i.e., the key-value pair) contain copies of selected
columns of the tabular data of the database.
Types of Indexes in DBMS
The classifica on of indexing in databases is dependent on the indexing a ributes. The two
primary types of indexing methods are known as
1. Primary Indexing.
2. Secondary Indexing.
Types of indexing
Primary Indexing
A Primary Index is a fixed-length, ordered file that contains two fields: the first field is
the primary key, and the second field points to the corresponding data block. In the Primary
Index, each entry in the index table has a one-to-one rela onship with the data block.
Primary indexing is further divided into two spaces:
Dense Indexing
1. It creates a index record for every record of the table with one field equal to the
index/search key value field and the other field points to the pointer to the data record to the
actual record on the disk.
2. It takes a lot of extra space as it makes single entry for every record of the table.
Sparse Index
[Link] this method of indexing technique, rather than having entry for each record of the actual
data and mapping it to a index field, what we do is we have only few entries in the index table
and we use range based indexing, in which the index column is sorted on the base of range.
2. It takes away the problem of dense indexing of having more space.
3. It requires less maintenance overhead.
4. The only con of using this indexing technique is that it takes a li le bit me in loca ng the
actual record. this trade-off of space and me is completely upon the solu on architect of
the team.
secondary indexing
Internal/Under the hood
Bucket of pointers
Bucket of pointers is a bucket/list of pointers poin ng to actual tuples/records of the table.
So now we have,
1. First Level: Dense Index where each value is stored once
2. Bucket list where we have mul ple occurrences
Pointers to actual values
Should be sequen al: i.e. ordered by the key.
3. Bucket pointers basically helps us to save a lot of space.
Clustering Index
In a clustered index, records themselves are stored in the Index and not pointers.
Indexes are created on non-primary key columns which might not be unique for each record.
In such a situa on, you can group two or more columns to get the unique values and create
an index which is called Clustered Index.
This also helps you to iden fy the record faster.
Example:
Lets assume in a school, all the students belonging to a same class can be clustered together
under one index.
And the index pointer will be the class_no and it points to the cluster as a whole.
Mul -Level Indexing
Before going in-depth , lets understand what is the need of having mul -level indexing and
problems associated with sequen al searching.
Sequen al Searching
Sequen al Searching, is a method of searching data in a linear manner, by going through each
record one by one un l the required data is found. This can be a me-consuming process for
large datasets, as each record must be searched in sequence un l the desired result is found.
Problem with Sequen al Searching
Lets try to understand the sequen al searching problem with the help of an example->
Consider a bank with 100 million customers. They all have their own account number, which
is an 8-digit number.
Some me, we querying the table containing the records of the customers, when the
customer wants to know about his/her account.
Imagine we didn’t have an index, just one big sequen al file. To find a par cular account, you
would go to the beginning of the file, and search one record a er another un l you either
find the record or come to the end of the file. This is a simple system but with so many records
it would take a long me. You could end up checking 100 million records before you found
the one you wanted!
Mul -level Indexing: Overview
Mul -level indexing is a technique used in databases to speed up the search process by
dividing the index into mul ple levels. This allows for faster access to the data, as the search
process only needs to traverse a smaller por on of the index.
Mul level Indexing in Database is created when a primary index does not fit in memory.
We create a sparse indexing on the sequen al file containing the records.
This sparse indexed file is divided into levels/por ons for be er searching.
Example->
Mul -level(3-level) indexing on account Number
Lets say in the above example of 100 million records entries, we are trying to find the account
number 45843521.
We can use the power of mul -level indexing to search for this record in a much lesser me
frame.
We will create index on the account number and divide the index on range of digits.
First level index -> Range from 0–9.
Second level index-> Range from 00–99.
Third Level index -> Range from 00–99 and so on.
In each level index , we do sequen al searching for the account number, as soon as we find a
hit , we move forward with the next level , un l or unless we reach the exact match of the
account number we are looking forward to.
“This combina on of mul -level indexing and sequen al searching will help in op miza on
of the query and search results in less me”
Variance Reduc on: Splits based on minimizing variance within each group.
3. Advantages of Decision Trees
Easy to Interpret: Mimics human decision-making.
No Feature Scaling Needed: Works well with raw data.
Handles Categorical & Numerical Data: Can process mixed-type data efficiently.
Robust to Missing Values: Can handle missing a ributes without requiring imputa on.
4. Drawbacks & Solu ons
Overfi ng: Trees may grow too complex, leading to poor generaliza on. Solu on: Use
pruning techniques (such as post-pruning or pre-pruning).
Bias in Imbalanced Data: If one class dominates, it affects accuracy. Solu on: Use
balanced datasets or cost-sensi ve learning.
5. Prac cal Applica ons
Medical Diagnosis: Predic ng diseases based on pa ent symptoms.
Fraud Detec on: Iden fying fraudulent transac ons in banking.
Sen ment Analysis: Classifying posi ve vs. nega ve customer reviews.
Spam Detec on: Filtering spam emails based on keywords and metadata.
Recall: Measures how many of the actual positive cases were correctly
identified.
Confusion Matrix: Shows how predictions are distributed across true and
predicted labels.
B. Regression Decision Trees
For models predicting numerical values, we use:
Mean Squared Error (MSE): Measures the average squared difference
between actual and predicted values.