0% found this document useful (0 votes)
4 views22 pages

Module 2

The document discusses various data mining techniques, including classification, clustering, regression, association rules, outlier analysis, sequential patterns, and prediction, highlighting their applications and methodologies. It also covers frequent pattern mining, emphasizing the importance of algorithms like Apriori and FP-Growth for discovering associations in large datasets. Additionally, it explains key concepts such as support, confidence, and lift in relation to association rule learning.

Uploaded by

angelbanga2666
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

Module 2

The document discusses various data mining techniques, including classification, clustering, regression, association rules, outlier analysis, sequential patterns, and prediction, highlighting their applications and methodologies. It also covers frequent pattern mining, emphasizing the importance of algorithms like Apriori and FP-Growth for discovering associations in large datasets. Additionally, it explains key concepts such as support, confidence, and lift in relation to association rule learning.

Uploaded by

angelbanga2666
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining Notes

UNIT – 2 ______________
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These tools can incorporate statistical models, machine
learning techniques, and mathematical algorithms, such as neural networks or decision trees. Thus,
data mining incorporates analysis and prediction.

1.​ Classification: This technique is used to obtain important and relevant information about
data and metadata. This data mining technique helps to classify data in different classes. Data
mining techniques can be classified by different criteria, as follows:
i​ Classification of Data mining frameworks as per the type of data sources mined: This
classification is as per the type of data handled. For example, multimedia, spatial data, text data,
time-series data, World Wide Web, and so on..
ii​ Classification of data mining frameworks as per the database involved: This classification based
on the data model involved. For example. Object-oriented database, transactional database,
relational database, and so on..
iii Classification of data mining frameworks as per the kind of knowledge discovered: This
classification depends on the types of knowledge discovered or data mining functionalities. For
example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be
extensive frameworks offering a few data mining functionalities together..
iv Classification of data mining frameworks according to data mining techniques used: This
classification is as per the data analysis approach utilized, such as neural networks, machine learning,
genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems.

2.​ Clustering: Clustering is a division of information into groups of connected objects.


Describing the data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a historical point
of view rooted in statistics, mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view, clustering plays
an extraordinary job in data mining applications. For example, scientific data exploration, text
mining, information retrieval, spatial database applications, CRM, Web analysis, computational
biology, medical diagnostics, and much more. In other words, we can say that Clustering analysis
is a data mining technique to identify similar data. This technique helps to recognize the
differences and similarities between the data. Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their similarities.

3.​ Regression: Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as availability,
consumer demand, and competition. Primarily it gives the exact relationship between two or more
variables in the given data set.

4.​ Association Rules: This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set. Association rules are if-then statements that
support to show the probability of interactions between data items within large data sets in
different types of databases. Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets. The way the algorithm works is that
you have various data, For example, a list of grocery items that you have been buying for the last
six months. It calculates a percentage of items being purchased together. These are three major
measurements technique:
o​ Lift:​ This measurement technique measures the accuracy of the confidence over how often item B
is purchased.
(Confidence) / (item B)/ (Entire dataset)
o​ Support: ​ This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o​ Confidence: ​ This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

5.​ Outlier Analysis : This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This technique may be
used in various domains like intrusion, detection, fraud detection, etc. It is also known as Outer
detection or Outilier mining. The outlier is a data point that diverges too much from the rest of the
dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6.​ Sequential Patterns: The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different criteria
like length, occurrence frequency, etc. In other words, this technique of data mining helps to
discover or recognize similar patterns in transaction data over some time.
7.​ Prediction: Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.

Mining Frequent patterns


Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a
data [Link] example, a set of items, such as milk and bread, that appear frequently together in a transaction
data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A
substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern. Finding frequent patterns plays an essential role in mining associations, correlations, and
many other interesting relationships among data. Moreover, it helps in data classification, clustering, and
other data mining tasks. Thus, frequent pattern mining has become an important data mining task and a
focused theme in data mining research. There are several different algorithms used for frequent pattern
mining, including: Apriori algo , FP growth algo.
Advantages:
1.​ It can find useful information which is not visible in simple data browsing
2.​ It can find interesting association and correlation among data items
Disadvantages:
1.​ It can generate a large number of patterns
2.​ With high dimensionality, the number of patterns can be very large, making it difficult to
interpret the results.
Association and Correlation
Association rule: Who buys bread and butter also buys jam
Basic idea: finding two events A and B such that if A happens then also B is likely to happen
​ ​ Notation: A ⇒B
Correlation rule:Who buys tea does not buy coffee

Basic ideas:– association rules do not capture many interesting dependencies among items
– e.g. who buys coffee do not usually buy tea
– the key is that buying coffee and buying tea are not associated but correlated
Correlation analysis explores the association between two or more variables and makes inferences about the
strength of the relationship.

Note: It is common to use the terms correlation and association interchangeably. Technically, association refers to
any relationship between two variables, whereas correlation is often used to refer only to a linear relationship
between two variables.

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the dependency
of one data item on another data item and maps accordingly so that it can be more profitable. It tries
to find some interesting relations or associations among the variables of dataset.
It is based on different rules to discover the interesting relations between variables in the database.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Association rule learning can be divided into
three types of algorithms:

1.​ Apriori
2.​ Eclat
3.​ F-P Growth Algorithm
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A then B. Here the
“If” element is called antecedent, and “then" statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as
single cardinality. It is all about creating rules, and if the number of items increases, then cardinality
also increases accordingly. So, to measure the associations between thousands of data items, there
are several metrics. These metrics are given below: o Support oConfidence oLift

Support :​ Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

Confidence : ​ Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is already given. It is the
ratio of the transaction that contains X and Y to the number of records that contain X.

Lift :​ ​ It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of
each other. It has three possible values:

o​ If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other. o
Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o​ Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect
on another.

Applications of Association Rule Learning


o​ Market Basket Analysis: It is one of the popular examples and applications of association rule
mining. This technique is commonly used by big retailers to determine the association
between items.
o​ Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps
in identifying the probability of illness for a particular disease.
o​ Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o​ It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say
that the apriori algorithm is an association rule leaning that analyzes that people who bought product
A also bought product B. The primary objective of the apriori algorithm is to create the association
rule between different objects. The association rule describes how two or more objects are related to
one another. Apriori algorithm is also called frequent pattern mining. Generally, you operate the
Apriori algorithm on a database that consists of a huge number of transactions. Let's understand the
apriori algorithm with the help of an example; suppose you go to Big Bazar and buy different
products. It helps the customers buy their products with ease and increases the sales performance of
the Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.

Introduction We take an example to understand the concept better. You must have noticed that the
Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a discount
to their customers who buy these combos. Do you ever think why does he do so? He thinks that
customers who buy pizza also buy soft drinks and breadsticks. However, by making combos, he
makes it easy for the customers. At the same time, he also increases his sales performance. Similarly,
you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled together. It shows that
the shopkeeper makes it comfortable for the customers to buy these products in the same place.
What is Apriori Algorithm? Apriori algorithm refers to an algorithm that is used in mining frequent
products sets and relevant association rules. Generally, the apriori algorithm operates on a database
containing a huge number of transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.

Components of Apriori algorithm


1.​ Support
2.​ Confidence
3.​ Lift

We have already discussed above; you need a huge database containing a large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together. Out of 4000 transactions, 400 contain Biscuits,
whereas 600 contain Chocolate, and these 600 transactions include a 200 that includes Biscuits and
chocolates. Using this data, we will find out the support, confidence, and lift.

Support : Support refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number of
transactions.

Hence, we get ​ ​ Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 %
Confidence : Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise both biscuits
and chocolates by the total number of transactions to get the confidence.

Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving
Biscuits) = 200/400 = 50 %

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift : Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits) D


= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together is five times
more than that of purchasing the biscuits alone. If the lift value is below one, it requires that the
people are unlikely to buy both the items together. Larger the value, the better is the combination.

How does the Apriori Algorithm work in Data Mining? The Apriori Algorithm makes the
given assumptions
o​ All subsets of a frequent itemset must be frequent.
o​ The subsets of an infrequent item set must be infrequent. o Fix a threshold support level.
How to improve the efficiency of the Apriori Algorithm?
Hash-based itemset counting : In hash-based itemset counting, you need to exclude the k-itemset
whose equivalent hashing bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction In transaction reduction, a transaction not involving any frequent X itemset
becomes not valuable in subsequent scans.

Advantages of Apriori Algorithm


oIt is used to calculate large itemsets.

OSimple to understand and apply.

Disadvantages of Apriori Algorithms


o​ Apriori algorithm is an expensive method to find support since the calculation has to pass through the
whole database.
o​ Sometimes, you need a huge number of candidate rules, so it becomes computationally more
expensive.

FP Growth Algorithm in Data Mining


The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The
core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree),
which retains the item set association information.

This algorithm works as follows:

o​ First, it compresses the input database creating an FP-tree instance to represent frequent items.
o​ After this first step, it divides the compressed database into a set of conditional databases, each
associated with one frequent pattern.
o​ Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns
and then concatenating them into the long frequent patterns. In large databases, holding the FP tree in
the main memory is impossible. A strategy to cope with this problem is to partition the database into
a set of smaller databases (called projected databases) and then construct an FP-tree from each of
these smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information
about frequent patterns in a database. Each transaction is read and then mapped onto a path in the
FP-tree. This is done until all transactions have been read. Different transactions with common
subsets allow the tree to remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is
to mine the most frequent pattern. Each node of the FP tree represents an item of the item set. The
root node represents null, while the lower nodes represent the item sets. The associations of the
nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while
forming the tree.
defines the FP-tree as the tree structure given below:

1.​ One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-header
table.
2.​ Each node in the item-prefix subtree consists of three fields:
o​ Item-name: registers which item is represented by the node;
o Count:the number of transactions represented by the portion of the path reaching the node;

o Node-link: links to the next node in the FP-tree carrying the same item name or null if there
is none.
3.​ Each entry in the frequent-item-header table consists of two fields:
o​ Item-name: as the same to the node;

o ​ Head of node-link: a pointer to the first node in the FP-tree carrying the item name.

Additionally,the frequent-item-header table can have the count support for an item. The below
diagram is an example of best-case scenario that occurs when all transactions have the same
itemset; the size of the FP-tree will be only a single branch of nodes.

The worst-case scenario occurs when every transaction has a unique item set. So the space needed
to store the tree is greater than the space used to store the original data set because the FP-tree
requires additional space to store pointers between nodes and the counters for each item. The
diagram below shows how a worst-case scenario FP-tree might appear. As you can see, the tree's
complexity grows with each transaction's uniqueness.
Generalized Sequential pattern mining
GSP is a very important algorithm in data mining. It is used in sequence mining from large
databases. Almost all sequence mining algorithms are basically based on a prior algorithm.
GSP uses a level-wise paradigm for finding all the sequence patterns in the data. It starts
with finding the frequent items of size one and then passes that as input to the next
iteration of the GSP algorithm. The database is passed multiple times to this algorithm. In
each iteration, GSP removes all the non-frequent itemsets. This is done based on a
threshold frequency which is called support. Only those itemsets are kept whose frequency
is greater than the support count. After the first pass, GSP finds all the frequent sequences
of length1 which are called 1-sequences. This makes the input to the next pass, it is the
candidate for 2-sequences. At the end of this pass, GSP generates all frequent
2-sequences, which makes the input for candidate 3-sequences. The algorithm is
recursively called until no more frequent itemsets are found.

Basic of Sequential Pattern (GSP) Mining:

•​ Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …,
sn}. As the name suggests, it is the sequence of items occurring together. It can be
considered as a transaction or purchased items together in a basket.
•​ Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g,
q, y, e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that
the subsequence is not necessarily consecutive items of the sequence. From the
sequences of databases, subsequences are found from which the generalized
sequence patterns are found at the end.
•​ Sequence pattern: A sub-sequence is called a pattern when it is found in multiple
sequences. The goal of the GSP algorithm is to mine the sequence patterns from the
large database. The database consists of the sequences. When a subsequence has a
frequency equal to more than the “support” value. For example: the pattern <a, b> is a
sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.

Methods for Sequential Pattern Mining:


Apriori-based Approaches
•​ GSP
•​ SPADE
Pattern-Growth-based Approaches
•​ FreeSpan
•​ PrefixSpan

Applications of sequential pattern mining


–Customer shopping sequences: •First buy computer, then CD-ROM, and then digital camera, within 3
months.
–Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc
.–Telephone calling patterns, Weblog click streams
–DNA sequences and gene structures

Sequence Database: A database that consists of ordered elements or events is called a


sequence database. Example of a sequence database:
[Link]. ​ SID ​ ​ sequences

1.​ 100 ​ <a(ab)(ac)d(cef)>

2.​ 200 ​ <(ad)c(bcd)(abe)>

3.​ 300 ​ <(ef)(ab)(def)cb>

4.​ 400 ​ <eg(adf)CBC>

Transaction: The sequence consists of many elements which are called transactions.
<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac), (d) and (cef) are the elements
of the sequence. These elements are sometimes referred as transactions. An element
may contain a set of items. Items within an element are unordered and we list them
alphabetically. For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we
prefer to put them in alphabetical order for convenience. The order of the elements of
the sequence matters unlike order of items in same transaction.
k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is
called a 2-len sequence. While finding the 2-length candidate sequence this term comes
into use. Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.
Support in k-length Sequence:
Support means the frequency. The number of occurrences of a given k-length sequence in
the sequence database is known as the support. While finding the support the order is
taken care.
Illustration:
Suppose we have 2 sequences in the database.
s1:<a(bc)b(cd)>​
s2:<b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}
Finding support of {bc}:
Since, b and c are present in same element, their order does not matter.
s1:<a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements
here. So, we don’t consider it. Hence, support of {(bc)} is 1.
How to join L1 and L1 to give C2?
L1 is the final 1-length sequence after pruning. After pruning all the entries left in the set
have supported greater than the threshold.
Case 1: Join {ab} and {ac}
s1:{ab}, s2: {ac}
After removing a from s1 and c from s2. s1’={b},
s2’={a} s1′ and s2′ are not same, so s1 and s2 can’t be
joined. Case 2: Join {ab} and {be} s1: {ab}, s2: {be}
After removing a from s1 and e from s2.
s1’={b},s2’={b} s1′ and s2′ are exactly same, so s1
and s2 be joined.
s1 + s2 = {abe}
note : s1 + s2 = {(ab)e} s1 and s2 are joined in such a way that items belong to correct
elements or transactions.

Pruning Phase: While building Ck (candidate set of k-length), we delete a candidate


sequence that has a contiguous (k-1) subsequence whose support count is less than the
minimum support (threshold). Also, delete a candidate sequence that has any
subsequence without minimum support. {abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the
support of its subsets. Because subsets of 3-length sequence will be 1 and 2 length
sequences. We build the candidate sets incremently like 1-length, 2-length and so on.
Subsets of {abg} are: {ab], {bg} and {ag}
Check support of all three subsets. If any of them have support less than minimum
support then delete the sequence {abg} from the set C3 otherwise keep it.

Challenges in Generalized Sequential Pattern Data Mining :The database is passed many
times to the algorithm recursively. The computational efforts are more to mine the frequent
pattern. When the sequence database is very large and patterns to be mined are long then
GSP encounters the problem in doing so effectively.

Clustering in Data Mining


Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data
points into clusters so that the objects belong to the same group. Clustering helps to splits data into
several subsets. Each of these subsets contains data similar to each other, and these subsets are
called clusters. Now that the data from our customer base is divided into clusters, we can make an
informed decision about who we think is best suited for this product. Clustering, is a method of data
mining that groups similar data points together. The goal of cluster analysis is to divide a dataset
into groups (or clusters) such that the data points within each group are more similar to each
other than to data points in other groups. This process is often used for exploratory data analysis
and can help identify patterns or relationships within the data that may not be immediately
obvious. There are many different algorithms used for cluster analysis, such as k-means,
hierarchical clustering, and density-based clustering. The choice of algorithm will depend on the
specific requirements of the analysis and the nature of the data being analyzed.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data. A
good clustering algorithm aims to obtain clusters whose: The intra-cluster similarities are high, It
implies that the data present inside the cluster is similar to one another. Or The inter-cluster similarity
is low, and it means each cluster holds data that is not similar to other data.

What is clustering in Data Mining?


o​ Clustering is the method of converting a group of abstract objects into classes of similar objects.
o​ Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called
clusters.
o​ It helps users to understand the structure or natural grouping in a data set and used either as a
standalone instrument to get a better insight into data distribution or as a pre-processing step for other
algorithms

o​ The over-classification main advantage is that it is adaptable to modifications, and it helps single out
important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:

o​ In many applications, clustering analysis is widely used, such as data analysis, market research, pattern
recognition, and image processing.
o​ It assists marketers to find different groups in their client base and based on the purchasing patterns.
They can characterize their customer groups.
o​ It helps in allocating documents on the internet for data discovery.

o​ Clustering is also used in tracking applications such as detection of credit card fraud. o As a data mining
function, cluster analysis serves as a tool to gain insight into the distribution of data to analyze the
characteristics of each cluster.

o​ In terms of biology, It can be used to determine plant and animal taxonomies, categorization of genes
with the same functionalities and gain insight into structure inherent to populations.
o​ It helps in the identification of areas of similar land that are used in an earth observation database and
the identification of house groups in a city according to house type, value, and geographical location.

Similarity in Data Mining


In Data Mining, similarity measure refers to distance with dimensions representing features of the
data object, in a dataset. If this distance is less, there will be a high degree of similarity, but when
the distance is large, there will be a low degree of similarity.
Some of the popular similarity measures are –
1.​ Euclidean Distance. -

2.​ Manhattan Distance. -

3.​ Jaccard Similarity. -

4.​ Minkowski Distance. -


5.​ Cosine Similarity. -

Typical Requirements Of Clustering In Data Mining:


●​ Scalability
●​ Ability to deal with different types of attributes
●​ Ability to deal with noisy data
●​ Incremental clustering and insensitivity to the order of input records
●​ High dimensionality
●​ Interpretability and usability

Clustering Methods:
●​ Partitioning Method
●​ Hierarchical Method
●​ Density-based Method
●​ Grid-Based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering
Method:

●​ One objective should only belong to only one group.


●​ There should be no group without even a single purpose.
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects
is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed.

Density-Based Method: (DBSCAN) The density-based method mainly focuses on density. In this
method, the given cluster will keep on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The radius
of a given cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the
object space is quantized into a finite number of cells that form a grid structure. One of the major
advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Advantages of Cluster Analysis:
1.​ It can help identify patterns and relationships within a dataset that may not be
immediately obvious.
2.​ It can be used for exploratory data analysis and can help with feature selection.
3.​ It can be used to reduce the dimensionality of the data.
4.​ It can be used for anomaly detection and outlier identification.
5.​ It can be used for market segmentation and customer profiling.
Disadvantages of Cluster Analysis:
1.​ It can be sensitive to the choice of initial conditions and the number of clusters.
2.​ It can be sensitive to the presence of noise or outliers in the data.
3.​ It can be difficult to interpret the results of the analysis if the clusters are not
well-defined.
4.​ It can be computationally expensive for large datasets.
5.​ The results of the analysis can be affected by the choice of clustering algorithm used.
6.​ It is important to note that the success of cluster analysis depends onthe data, the goals
of the analysis, and the ability of the analyst to interpret the results.

Partitioning methods in data mining


Partitioning Method:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods.
In the partitioning method when database that contains multiple objects then the
partitioning method constructs user-specified(K) partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(K-Mediods),
CLARA algorithm (Clustering Large Applications) etc.
K-Mean (A centroid based Technique):
The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster.
It is a type of square error algorithm. At the start randomly k objects from the dataset are
chosen in which each of the objects represents a cluster mean(centre). For the rest of the
data objects, they are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated with the added data
objects.
Algorithm: K mean:
1.​ First, we randomly initialize k points, called means or cluster centroids.
2.​ We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3.​ We repeat the process for a given number of iterations and at the end, we have our
clusters.
K-Medoids clustering(partitioning methods):
A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points in
the cluster are minimum. The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E
= |Pi – Ci|

Algorithm:

1.​ Initialize: select k random points out of the n data points as the medoids.
2.​ Associate each data point to the closest medoid by using any common distance
metric methods.
3.​ While the cost decreases: For each medoid m, for each data o point which is not a
medoid:-
●​ Swap m and o, associate each data point to the closest medoid, and recompute the cost.
●​ If the total cost is more than that in the previous step, undo the swap.
Advantages:

1.​ It is simple to understand and easy to implement.


2.​ K-Medoid Algorithm is fast and converges in a fixed number of steps.
3.​ PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:

1.​ The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering
non-spherical (arbitrarily shaped) groups of objects. This is because it relies on
minimizing the distances between the non-medoid objects and the medoid (the cluster
center) – briefly, it uses compactness as clustering criteria instead of connectivity.
2.​ It may obtain different results for different runs on the same dataset because the first k
medoids are chosen randomly.

Hierarchical clustering in data mining


In this method, a hierarchical decomposition of the given set of data objects is created. We can
classify hierarchical methods and will be able to know the purpose of classification on the basis of
how the hierarchical decomposition is formed. Hierarchical clustering refers to an unsupervised
learning procedure that determines successive clusters based on previously defined clusters. It
works via grouping data into a tree of clusters. Hierarchical clustering stats by treating each data
points as an individual cluster. The endpoint refers to a different set of clusters, where each
cluster is different from the other cluster, and the objects within each cluster are the same as one
another.

●​ Agglomerative Approach: The agglomerative approach is also known as the


bottom-up approach. Initially, the given data is divided into which objects form separate
groups. Thereafter it keeps on merging the objects or the groups that are close to one
another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.
●​ Divisive Approach: The divisive approach is also known as the top-down approach. In
this approach, we would start with the data objects that are in the same cluster. The
group of individual clusters is divided into small clusters by continuous iteration. The
iteration continues until the condition of termination is met or until each cluster contains one
object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality in
Data Mining are: –

●​ One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
●​ One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the
microcluster.
Agglomerative hierarchical clustering
Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting).
In agglomerative clustering, each data point act as an individual cluster and at each step, data objects
are grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the
clusters are combined with different clusters until one cluster is formed.

Agglomerative hierarchical clustering algorithm

●​ Consider each alphabet as a single cluster and calculate the distance of one cluster
from all the other clusters.
●​ In the second step, comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
●​ We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
●​ Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
●​ At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
Divisive Hierarchical Clustering
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, all the data points are considered an individual cluster, and in every
iteration, the data points that are not similar are separated from the cluster. The separated data points
are treated as an individual cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
●​ It is simple to implement and gives the best output in some cases.
●​ It is easy and results in a hierarchy, a structure that contains moreinformation.
●​ It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


●​ It breaks the large clusters.
●​ It is Difficult to handle different sized clusters and convex shapes.
●​ It is sensitive to noise and outliers.
●​ The algorithm can never be changed or deleted once it was done previously.

Density-based clustering in datamining


Density-based clustering refers to a method that is based on local cluster criterion, such as density
connected points. Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the region
separated by two clusters of low point density are considered as noise. The surroundings with a
radius ε of a given object are known as the ε neighborhood of the object. If the ε neighborhood of the
object comprises at least a minimum number, MinPts of objects, then it is called a core object.

Major Features of Density-Based Clustering


o​ It is a scan method.
o​ It requires density parameters as a termination condition.
o ​ It is used to manage noise in data clusters.
o​ Density-based clustering is used to identify clusters of arbitrary size.

DBSCAN : DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends
on a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database
with outliers.
OPTICS : OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of parameter
settings. OPTICS methods are beneficial for both automatic and interactive cluster analysis, including
determining an intrinsic clustering structure.
DENCLUE : Density-based clustering enables a compact mathematical description of arbitrarily
shaped clusters in high dimension state of data, and it is good for data sets with a huge amount of
noise.

Clustering with constraints


Constrained clustering is an approach to clustering the data while it incorporates the domain
knowledge in form of constraints. All data including input data, constraints, and domain knowledge
are processed in the clustering process with constraints and give the output clusters as an output.
There are various methods for clustering with constraints and can handle specific constraints:
•​ Handling Hard Constraints: There is a method for handling the hard constraints by regarding
the constraint in a cluster assignment procedure. It is a very important method for handling the
difficult constraints we can regard the constraints in the assignment procedure of cluster.
•​ Generating the super instances for must-link constraints: There are must-link constraints
that have transitive closure that can be calculated by it. so that we can say that must-link
constraints are known as an equivalence relation. The subset can be defined by it. In subset,
there are some objects which can be replaced by the mean.
•​ Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the
clustering aspect. For example, if we take two sets one is of data sets and the other is a set of
constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm,
K-means clustering is enforced to constraint violation penalty. The main objective of CVQE is
the total distance used for K-means which are used as follows:
•​ Penalty in must-link violation: This penalty occurs due to when there is a mustlink
constraint present on objects y, x. They are created to the given two centers C1, C2 by which
the constraint can be violated hence the distance that lies between C1 and C2 is inserted but
as a penalty.
•​ Penalty in cannot-link violation: This type of penalty is different from a must-link
violation as in this penalty there is one center created to a common center C when cannot link
is present on objects x, y. Therefore the constraints are violated and hence the distance that
lies between (C, C) can be inserted in the objective function and it is recognized as a penalty.

Outlier Detection
An outlier is an individual point of data that is distant from other points in the dataset. It is an
anomaly
in the dataset that may be caused by a range of errors in capturing, processing or manipulating
data.
Outliers can skew overall data trends, so outlier detection methods are an important part of
statistics.
Outliers will be a consideration for any area that uses data to make decisions. If an organisation is
gaining insight from data, outliers are a real risk.
The two main types of outlier detection methods are:
Distance and density
This outlier detection technique uses the proximity of data points within two or three dimensions to
identify outliers. This will usually be a contextual or point outlier as explained earlier in this guide.
Different outlier detection models will use different approaches. Distance flags a data point as an
outlier if it is mapped beyond a certain threshold away from other data points. The density
approach groups together data points as clusters, using the distance between each cluster point to
set the boundary of the grouping. Outliers will be data points that exist outside of the cluster,
beyond the user-defined threshold.

Predicting data point distribution


This outlier detection technique uses statistical models to predict the probability of a dataset’s
distribution. The approach can be used to model the probable distribution and highlight data points
as outliers if they do not meet this distribution. All types of outlier can be identified through this
technique, but is particularly useful for identifying collective outliers. Once identified, the
organisation can undergo outlier analysis to understand the underlying issue.

BIRCH method
Clustering algorithms like K-means clustering do not perform clustering very efficiently and it is
difficult to process large datasets with a limited amount of resources (like memory or a slower
CPU). So, regular clustering algorithms do not scale well in terms of running time and quality as
the size of the dataset increases. This is where BIRCH clustering comes in. Balanced Iterative
Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large
datasets by first generating a small and compact summary of the large dataset that retains as
much information as possible. This smaller summary is then clustered instead of clustering the
larger dataset. BIRCH is often used to complement other clustering algorithms by creating a
summary of the dataset that the other clustering algorithm can now use. However, BIRCH has
one major drawback – it can only process metric attributes. A metric attribute is any attribute
whose values can be represented in Euclidean space i.e., no categorical attributes should be
present. Before we implement BIRCH, we must understand two important terms: Clustering
Feature (CF) and CF – Tree Clustering Feature (CF): BIRCH summarizes large datasets into
smaller, dense regions called Clustering Feature (CF) entries. Formally, a Clustering Feature entry
is defined as an ordered triple, (N, LS, SS) where ‘N’ is the number of data points in the cluster,
‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the
cluster. It is possible for a CF entry to be composed of other CF entries. CF Tree: The CF tree is the
actual compact representation that we have been speaking of so far. A CF tree is a tree where
each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node
and a CF entry made up of the sum of CF entries in the child nodes. There is a maximum number
of entries in each leaf node. This maximum number is called the threshold. We will learn more
about what this threshold value is. Parameters of BIRCH Algorithm :
●​ threshold : threshold is the maximum number of data points a sub-cluster in the leaf
node of the CF tree can hold.
●​ branching_factor : This parameter specifies the maximum number of CF sub-clusters in
each node (internal node).
●​ n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is
complete i.e., number of clusters after the final clustering step. If set to None, the final
clustering step is not performed and intermediate clusters are returned.

You might also like