0% found this document useful (0 votes)
5 views27 pages

Understanding Cluster Analysis Methods

Cluster analysis is the process of grouping similar data objects into clusters, with applications in market research, biology, and outlier detection. Key requirements for effective clustering include scalability, handling various data types, and the ability to discover clusters of arbitrary shapes. Major clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with specific algorithms like k-means and k-medoids for partitioning.

Uploaded by

ck636563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views27 pages

Understanding Cluster Analysis Methods

Cluster analysis is the process of grouping similar data objects into clusters, with applications in market research, biology, and outlier detection. Key requirements for effective clustering include scalability, handling various data types, and the ability to discover clusters of arbitrary shapes. Major clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based approaches, each with specific algorithms like k-means and k-medoids for partitioning.

Uploaded by

ck636563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Cluster Analysis

4.1 Cluster Analysis:


The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
SAS.

4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

4.1.2 Typical Requirements Of Clustering InData Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clusterings depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms thatare
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor [Link]
clustering algorithms are good at handling low-dimensional data,involving only two to
three dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city’s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

4.2 Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
4.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
4.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

4.3 Tasks in Data Mining:


 Clustering High-Dimensional Data
 Constraint-Based Clustering
4.3.1 Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in the
data islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

4.3.2 Constraint-Based Clustering:


It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.

4.4 Classical Partitioning Methods:


The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
4.4.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method

4.4.2 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj


The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering

4.4.2 Thek-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.


The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.

4.5 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

4.5.1 Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

4.5.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerativehierarchical clustering by
starting with all objects in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster
on itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.
Why is Choosing the Right Number of Clusters Important?
The quality and interpretability of your clustering results are strongly impacted by the number of clusters you choose,
therefore choosing the right number is essen al. You run the danger of oversimplifying the underlying structure of
your data if you select too few clusters, which could lead to the loss of significant pa erns or groups. On the other
hand, choosing an excessive number of clusters might result in overfi ng, when the model iden fies pa erns that are
difficult to extrapolate to fresh data.
Methods for Determining the Number of Clusters
1. The Elbow Method:
 The elbow method is one of the most commonly used techniques for determining the number of clusters.
 It involves running the clustering algorithm with different numbers of clusters and calcula ng the within-cluster
sum of squares (WCSS) for each number.
 Mathema cally, WCSS is computed as follows:

where k is the number of clusters, nᵢ is the number of data points in cluster i, xⱼ is a data point in cluster i, and μᵢ is the
centroid of cluster i.
 The idea is to look for the “elbow point” on the graph, where the rate of decrease in WCSS starts to slow down.
 The number of clusters at the elbow point is o en considered a reasonable choice.
2. The Silhoue e Score:
 The silhoue e score measures how similar an object is to its own cluster compared to other clusters.
 Mathema cally, the silhoue e score for a data point x is computed as:

 where a(x) is the average distance from x to the other data points in the same cluster, and b(x) is the minimum
average distance from x to data points in a different cluster.
 A higher silhoue e score indicates that the object is well-matched to its cluster.
3. Gap Sta s cs:
 Gap sta s cs compare the performance of your clustering algorithm on the actual data to its performance on
random data (generated under the assump on of no meaningful clusters).
 Mathema cally, gap sta s cs involve comparing the observed within-cluster sum of squares to the expected
within-cluster sum of squares under a null hypothesis.
4. Dendrogram Analysis:
 A dendrogram is a tree-like diagram that shows the hierarchical rela onships between clusters. It is created by
a clustering algorithm, which starts by trea ng each data point as a separate cluster. The algorithm then
repeatedly merges the two closest clusters un l there is only one cluster le . The dendrogram shows the order
in which the clusters were merged.
 The branches of a dendrogram represent the distances between clusters. The closer the two branches are, the
more similar the two clusters are. The point where the branches start to merge less rapidly is the point where
the clusters are becoming more dis nct.
 The number of clusters at this point can be a reasonable es mate of the number of clusters in the data.
However, it is important to note that this is just an es mate, and the actual number of clusters may vary
depending on the clustering algorithm and the data set.
5. Predic on Strength and Co-Membership Matrix:
 Predic on Strength (PS) assesses the stability of the clusters by repeatedly subsampling the data and clustering
subsets of it.
 It measures how o en data points that belong to the same cluster in the original data are s ll clustered
together in the subsamples.
 A higher Predic on Strength indicates more stable and robust clusters.
What is Classifica on?
Classifica on is a data mining technique that uses a set of training data to determine the class or category of a new
observa on. This method of supervised learning uses sta s cal and machine learning techniques to create a model
that can categorise fresh data in accordance with the pa erns seen in the training data.
 A dataset is split into a training set and a test set for classifica on. The classifica on model is constructed using
the training set, and its effec veness is assessed using the test set.
 The classifica on algorithm gains exper se from the training data and applies it to forecast the class of
incoming, untainted data.
 Many applica ons, including image recogni on, spam filtering, fraud detec on, and medical diagnosis, heavily
rely on classifica on.
 Decision trees, k?nearest neighbours, support vector machines, and neural networks are some common
categoriza on algorithms.
Classifica on can be either "binary classifica on" or "mul nomial classifica on".
 When there are exactly two target classes, then it is known as binary classifica on.
 When there are more than two target classes, as in the case of pa ern recogni on issues, then it is known
as mul nomial classifica on.
Advantages of Applying Classifica on in Data Mining
Following are the advantages of applying Classifica on in Data Mining:
 Predic ve power: In order to forecast the class or category of new data, classifica on can help find pa erns in
data that can be u lised for predic on and decision?making.
 Interpretable results: As many classifica on algorithms provide models that are simple to understand, it is
simpler for people to comprehend the logic behind a given classifica on.
 Scalability: Classifica on is a scalable data mining technique since it can be used on big datasets.
 Versa lity: Classifica on is flexible and broadly applicable since it can be applied to many different forms of
data, including numerical and categorical data.
Disadvantages of Applying Classifica on in Data Mining
Following are the disadvantages of applying Classifica on in Data Mining:
 Overfi ng: When a classifica on model fits training data too closely, it is said to be overfit, which leads to
subpar performance on new, untried data.
 Bias: Classifica on models may favour some classes or traits over others, which could lead to incorrect
predic ons.
 Data quality: Inaccurate or inadequate data can lead to incorrect predic ons, which can have an impact on
how accurate the categoriza on model is.
 Complexity: Certain categoriza on algorithms can be quite difficult to develop and interpret because they
need a lot of computer power.
 Sensi vity to input data: Classifica on models are some mes suscep ble to changes in the input data, which
can have a major impact on the projected classes.
Segmenta on in big data refers to dividing large datasets into meaningful groups or
segments based on specific characteris cs. This helps in targeted analy cs, improved
decision-making, and efficient data processing. Segmenta on is widely used in areas like
marke ng, healthcare, and fraud detec on.
1. Types of Segmenta on in Big Data
Big data can be segmented based on different criteria:
A. Demographic Segmenta on
Used in customer analy cs, this involves dividing data based on:
- **Age Groups**: Analysing behaviour of different age demographics.
- **Geographical Loca on**: Sor ng data based on country, city, or region.
- **Income Levels**: Understanding consumer spending pa erns.
B. Behavioural Segmenta on
Divides data based on how users interact with systems:
Purchase History: Categorizing customers based on previous transac ons.
Browsing Pa erns: Iden fying frequent website visits and interests.
Usage Frequency: Sor ng users into heavy, moderate, and light usage groups.
C. Temporal Segmenta on
Involves dividing data based on me-dependent behaviours:
Seasonal Trends: Understanding customer behaviour across different seasons.
Event-based Analysis: Studying reac ons before and a er a par cular event.
D. Spa al Segmenta on
Uses loca on-based analy cs:
Heatmaps for Customer Density: Helps retailers analyse foot traffic.
Urban vs. Rural Behaviour: Different purchase pa erns in ci es versus villages.
2. Methods of Segmenta on in Big Data
There are different techniques used to segment big data:
A. Clustering-Based Segmenta on
K-Means Clustering: Groups similar data points into clusters.
Hierarchical Clustering: Builds tree-like segments based on similari es.
DBSCAN (Density-Based Spa al Clustering of Applica ons with Noise): Iden fies clusters
based on density.
B. Machine Learning-Based Segmenta on
Supervised Learning: Uses labelled data to categorize en es.
Unsupervised Learning: Discovers natural groupings in unlabelled data.
C. Rule-Based Segmenta on
- Uses predefined business rules to classify data, such as segmen ng customers based on
purchase frequency.
D. Graph-Based Segmenta on
Community Detec on Algorithms: Finds influen al groups in networks (e.g., social media
analysis).
Link Analysis: Iden fies rela onships among nodes in large datasets.
3. Applica ons of Segmenta on in Big Data
Segmenta on helps businesses make data-driven decisions:
Retail: Personalized marke ng based on customer preferences.
Healthcare: Pa ent risk profiling for targeted treatments.
Fraud Detec on: Iden fying suspicious pa erns in financial transac ons.
Social Media Analy cs: Tracking sen ment trends within user groups.
What is linear regression?
Linear regression is a data analysis technique that predicts the value of unknown data by
using another related and known data value. It mathema cally models the unknown or
dependent variable and the known or independent variable as a linear equa on. For instance,
suppose that you have data about your expenses and income for last year. Linear regression
techniques analyse this data and determine that your expenses are half your income. They
then calculate an unknown future expense by halving a future known income.
How does linear regression work?
At its core, a simple linear regression technique a empts to plot a line graph between two
data variables, x and y. As the independent variable, x is plo ed along the horizontal axis.
Independent variables are also called explanatory variables or predictor variables. The
dependent variable, y, is plo ed on the ver cal axis. You can also refer to y values as response
variables or predicted variables.
Steps in linear regression
For this overview, consider the simplest form of the line graph equa on between y and x;
y=c*x+m, where c and m are constant for all possible values of x and y. So, for example,
suppose that the input dataset for (x,y) was (1,5), (2,8), and (3,11). To iden fy the linear
regression method, you would take the following steps:
1. Plot a straight line, and measure the correla on between 1 and 5.
2. Keep changing the direc on of the straight line for new values (2,8) and (3,11) un l all
values fit.
3. Iden fy the linear regression equa on as y=3*x+2.
4. Extrapolate or predict that y is 14 when x is
What are the types of linear regression?
Some types of regression analysis are more suited to handle complex datasets than others.
The following are some examples.
Simple linear regression
Simple linear regression is defined by the linear func on:
Y= β0*X + β1 + ε
β0 and β1 are two unknown constants represen ng the regression slope, whereas ε (epsilon)
is the error term.
You can use simple linear regression to model the rela onship between two variables, such
as these:
 Rainfall and crop yield
 Age and height in children
 Temperature and expansion of the metal mercury in a thermometer
Mul ple linear regression
In mul ple linear regression analysis, the dataset contains one dependent variable and
mul ple independent variables. The linear regression line func on changes to include more
factors as follows:
Y= β0*X0 + β1X1 + β2X2+…… βnXn+ ε
As the number of predictor variables increases, the β constants also increase correspondingly.
Mul ple linear regression models mul ple variables and their impact on an outcome:
 Rainfall, temperature, and fer lizer use on crop yield
 Diet and exercise on heart disease
 Wage growth and infla on on home loan rates

Logis c regression
Data scien sts use logis c regression to measure the probability of an event occurring. The
predic on is a value between 0 and 1, where 0 indicates an event that is unlikely to happen,
and 1 indicates a maximum likelihood that it will happen. Logis c equa ons use logarithmic
func ons to compute the regression line.
These are some examples:
 The probability of a win or loss in a spor ng match
 The probability of passing or failing a test
 The probability of an image being a fruit or an animal

Machine Learning (ML) Search in Big Data


ML-based search in Big Data refers to using machine learning techniques to enhance,
automate, and op mize search opera ons in massive datasets. Tradi onal keyword-based
search is o en insufficient when dealing with the volume, velocity, and variety of Big Data.
ML brings intelligence and relevance to search.

What is ML Search?
It is the use of machine learning algorithms to:
 Improve search relevance
 Understand user intent
 Personalize search results
 Predict queries or autocomplete
 Cluster and classify search results

Applica ons of ML Search in Big Data


Applica on Descrip on
Enterprise Search Extrac ng relevant documents from large corpora
E-commerce Search Ranking products based on relevance, clicks
Medical/Legal Search Understanding complex queries with domain-specific language
Social Media Monitoring Searching trends, keywords, user sen ment
Scien fic Research Finding pa erns or publica ons from huge academic datasets
ML Techniques Used in Big Data Search
Technique Purpose
Natural Language Processing (NLP) Understand text-based queries
Clustering (e.g., K-Means) Group similar documents or results
Classifica on Classify documents or content into categories
Ranking Algorithms (e.g., Learning to
Order search results based on relevance
Rank)
Recommenda on Systems Suggest related content or next ac ons
Convert queries and documents into vectors for
Embeddings (e.g., Word2Vec, BERT)
similarity search

Example Use Case


E-commerce Search:
 Input query: "Running shoes for flat feet"
 ML:
o Interprets intent (running + ortho cs need)
o Classifies product types
o Ranks products based on past click and purchase behavior
o Personalizes results based on user profile

Tools for ML Search in Big Data


Tool Func onality
Apache Solr + ML plugins Scalable text search engine with ML support
Elas csearch + Learning to Rank Powerful full-text search with ML models
Amazon Kendra Intelligent enterprise search using NLP
Google Cloud Search ML-powered enterprise search
Facebook FAISS Fast vector similarity search using embeddings
Annoy (Spo fy) Approximate nearest neighbor search

Benefits of ML-Enhanced Search


 Improved Accuracy – Be er matching of user intent
 Smarter Autocomplete – Predicts queries before typing is done
 Contextual Search – Understands domain and context
 Content Discovery – Finds related, relevant info automa cally
 Adap ve Learning – Improves from user feedback and clicks

Challenges
 Scalability – Must handle billions of documents and real- me search
 Training Data – Needs large, labeled datasets
 Latency – Real- me or near-real- me performance required
 Bias and Fairness – Ensure fair and unbiased results
 Privacy – Protect user data during search personaliza on
What is Indexing in DBMS?
 Indexing is used to quickly retrieve particular data from the database.
Formally we can define Indexing as a technique that uses data structures to
optimize the searching time of a database query in DBMS. Indexing reduces
the number of disks required to access a particular data by internally creating
an index table.
 Indexing is achieved by creating Index-table or Index.
 Index usually consists of two columns which are a key-value pair. The two
columns of the index table(i.e., the key-value pair) contain copies of selected
columns of the tabular data of the database.
Types of Indexes in DBMS
The classifica on of indexing in databases is dependent on the indexing a ributes. The two
primary types of indexing methods are known as
1. Primary Indexing.
2. Secondary Indexing.

Types of indexing
Primary Indexing
A Primary Index is a fixed-length, ordered file that contains two fields: the first field is
the primary key, and the second field points to the corresponding data block. In the Primary
Index, each entry in the index table has a one-to-one rela onship with the data block.
Primary indexing is further divided into two spaces:
Dense Indexing
1. It creates a index record for every record of the table with one field equal to the
index/search key value field and the other field points to the pointer to the data record to the
actual record on the disk.
2. It takes a lot of extra space as it makes single entry for every record of the table.
Sparse Index
[Link] this method of indexing technique, rather than having entry for each record of the actual
data and mapping it to a index field, what we do is we have only few entries in the index table
and we use range based indexing, in which the index column is sorted on the base of range.
2. It takes away the problem of dense indexing of having more space.
3. It requires less maintenance overhead.
4. The only con of using this indexing technique is that it takes a li le bit me in loca ng the
actual record. this trade-off of space and me is completely upon the solu on architect of
the team.

Secondary Index/Non-clustered Index


Secondary Index can be created based on a field that contains unique values for each record,
and this field should be a candidate key.
The Secondary Index is also referred to as a non-clustering index.
This indexing technique involves two levels and is used to reduce the mapping size of the first
level. For the first level, a large range of numbers is selected to keep the mapping size small.

secondary indexing
Internal/Under the hood
Bucket of pointers
Bucket of pointers is a bucket/list of pointers poin ng to actual tuples/records of the table.
So now we have,
1. First Level: Dense Index where each value is stored once
2. Bucket list where we have mul ple occurrences
 Pointers to actual values
 Should be sequen al: i.e. ordered by the key.
3. Bucket pointers basically helps us to save a lot of space.
Clustering Index
In a clustered index, records themselves are stored in the Index and not pointers.
Indexes are created on non-primary key columns which might not be unique for each record.
In such a situa on, you can group two or more columns to get the unique values and create
an index which is called Clustered Index.
This also helps you to iden fy the record faster.
Example:
Lets assume in a school, all the students belonging to a same class can be clustered together
under one index.
And the index pointer will be the class_no and it points to the cluster as a whole.
Mul -Level Indexing
Before going in-depth , lets understand what is the need of having mul -level indexing and
problems associated with sequen al searching.
Sequen al Searching
Sequen al Searching, is a method of searching data in a linear manner, by going through each
record one by one un l the required data is found. This can be a me-consuming process for
large datasets, as each record must be searched in sequence un l the desired result is found.
Problem with Sequen al Searching
Lets try to understand the sequen al searching problem with the help of an example->
Consider a bank with 100 million customers. They all have their own account number, which
is an 8-digit number.
Some me, we querying the table containing the records of the customers, when the
customer wants to know about his/her account.
Imagine we didn’t have an index, just one big sequen al file. To find a par cular account, you
would go to the beginning of the file, and search one record a er another un l you either
find the record or come to the end of the file. This is a simple system but with so many records
it would take a long me. You could end up checking 100 million records before you found
the one you wanted!
Mul -level Indexing: Overview
Mul -level indexing is a technique used in databases to speed up the search process by
dividing the index into mul ple levels. This allows for faster access to the data, as the search
process only needs to traverse a smaller por on of the index.
Mul level Indexing in Database is created when a primary index does not fit in memory.
We create a sparse indexing on the sequen al file containing the records.
This sparse indexed file is divided into levels/por ons for be er searching.
Example->
Mul -level(3-level) indexing on account Number
Lets say in the above example of 100 million records entries, we are trying to find the account
number 45843521.
We can use the power of mul -level indexing to search for this record in a much lesser me
frame.
We will create index on the account number and divide the index on range of digits.
First level index -> Range from 0–9.
Second level index-> Range from 00–99.
Third Level index -> Range from 00–99 and so on.
In each level index , we do sequen al searching for the account number, as soon as we find a
hit , we move forward with the next level , un l or unless we reach the exact match of the
account number we are looking forward to.
“This combina on of mul -level indexing and sequen al searching will help in op miza on
of the query and search results in less me”

Inverted Indexing Using JAQL in Big Data


Inverted indexing is a technique used in big data processing to efficiently retrieve documents
containing specific keywords. It is widely used in search engines, text analysis, and large-scale
data retrieval systems. JAQL (JSON Query Language), a declara ve language designed for
querying JSON data in big data environments, can be used to construct inverted indexes
efficiently.
1. What Is an Inverted Index?
An inverted index maps keywords to the loca ons where they appear in a dataset. Unlike
tradi onal indexes that map documents to keywords, inverted indexing reverses the
mapping.
2. Why Use JAQL for Inverted Indexing in Big Data?
JAQL is op mized for handling large-scale JSON-based datasets, commonly found in big data
environments. JAQL’s flexible querying capabili es make it suitable for:
 Processing semi-structured data such as logs, tweets, and sensor readings.
 Indexing large-scale documents efficiently for search opera ons.
 Suppor ng Hadoop integra on for scalable big data processing.
Data Explorer Bundling Hadoop Job Applica on & Diagnos cs –
Detailed Explana on
Data Explorer provides a way to analyze, debug, and op mize Hadoop job applica ons by
bundling execu on metadata, logs, and performance diagnos cs. This bundling simplifies
troubleshoo ng and efficiency tracking for large-scale distributed jobs in Hadoop ecosystems.
1. Understanding Hadoop Job Bundling in Data Explorer
When execu ng Hadoop applica ons, a vast amount of metadata, logs, execu on graphs, and
resource consump on data is generated. Data Explorer bundles these details, making it
easier to:
 Analyze job behavior.
 Detect performance bo lenecks.
 Improve resource u liza on.
 Debug failures efficiently.
This bundling typically includes:
 Job Configura on: Input parameters, file paths, resource alloca on.
 Execu on Logs: Mapper/reducer logs, run me errors, warnings.
 Performance Metrics: CPU, memory, disk usage, network traffic.
 Cluster Diagnos cs: Node failures, applica on restarts, execu on me.
 Data Flow Informa on: DFDs, dependency graphs for execu on tracking.
2. How Bundling Works in Hadoop Jobs
Step 1: Capturing Execu on Metadata
Data Explorer collects job-specific metadata, including:
 Job name, execu on me, and associated data files.
 Input/output dataset details and processing steps.
Step 2: Aggrega ng Logs
All logs (driver, mapper, reducer, executor) are bundled for easy access:
 Mapper Logs: Track data par oning and transforma on stages.
 Reducer Logs: Debug aggrega on, merging, and key-value opera ons.
 Driver Logs: Overall job execu on status.
Step 3: Performance Diagnos cs
Data Explorer gathers key resource sta s cs:
 CPU Usage: Helps analyze computa on overhead.
 Memory Consump on: Detects excessive heap alloca ons.
 Disk I/O: Determines efficiency in read/write opera ons.
 Network Traffic: Iden fies data transfer bo lenecks.
Step 4: Genera ng Dependency Graphs & Execu on Flowcharts
 Dependency Mapping: Traces rela onships between jobs and datasets.
 Flowchart Visualiza on: Displays job execu on sequence.
Step 5: Packaging the Data for Review
All diagnos c details are bundled into a structured archive that includes:
 JSON/XML reports summarizing execu on.
 Log files for debugging.
 Resource consump on charts for performance analysis.
3. Importance of Data Explorer in Hadoop Job Diagnos cs
Using Data Explorer in Hadoop job diagnos cs is essen al for:
 Failure Analysis: Detects node crashes, resource starva on.
 Performance Op miza on: Iden fies inefficient code or cluster bo lenecks.
 Debugging Workflow Issues: Traces incorrect dependencies in job execu on.
 Resource Alloca on Improvements: Ensures op mal memory, CPU, and disk usage.
4. Prac cal Example
Consider a Hadoop MapReduce job that processes large-scale log data:
 Issue: The job takes longer than expected.
 Data Explorer Diagnos c Steps:
1. Iden fy high CPU/memory consump on in specific reducers.
2. Detect uneven data distribu on across mappers.
3. Check for network latency issues in shuffle opera ons.
4. Suggest tuning job parameters (e.g., increasing par ons, modifying heap sizes).
5. Key Benefits for Big Data Engineers
Since you have exper se in big data frameworks and indexing methods, using Data Explorer
can enhance Hadoop job performance by:
 Streamlining debugging with structured logs.
 Improving scalability by op mizing resource alloca on.
 Enhancing distributed compu ng efficiency by reducing bo lenecks.

Part 2: Diagnos cs in Hadoop Jobs (with Data Explorer or Similar Tools)

Why Diagnos cs?


Running jobs on Hadoop may fail or perform poorly. Diagnos cs help you:
 Find why a job failed
 Understand performance bo lenecks
 Op mize resource usage (CPU, memory, disk I/O)

Common Diagnos c Metrics to Monitor


Metric What It Tells You
Job Status Success, failure, in progress
Task Comple on Time Time taken by map/reduce tasks
Shuffle & Sort Metrics Amount of intermediate data exchanged
Resource Usage Memory, CPU, disk I/O per task
YARN logs Detailed logs for containers and errors
Counters Custom metrics like rows processed, bytes read

How Diagnos cs Work in Tools like Data Explorer


1. Job History Retrieval
Pulls metadata from the ResourceManager, YARN Timeline Server, or
JobHistoryServer.
2. Log Aggrega on
Accesses logs from HDFS or a log server. Helps view:
o Task a empts
o Stack traces on errors
o Resource alloca on problems
3. Visualiza on
Tools like:
o Data Explorer
o Apache Ambari
o Cloudera Manager
o Apache Hadoop Web UI
provide charts or dashboards showing:
o Heatmaps for slow tasks
o Task retries
o Skewed data

Common Job Failure Causes and Diagnos cs


Issue Diagnos c Clue Resolu on
Out of Memory Logs show OOM or killed container Increase container memory
Repar on or redesign key
Data Skew Some reducers take much longer
distribu on
Missing Files FileNotFoundExcep on in logs Check input HDFS paths
Dependency ClassNotFound or
Include missing JARs in the lib folder
Issues NoClassDefFoundError
Slow Long task mes or failed specula ve Tune the number of
Performance execu on mappers/reducers or IO formats

Decision Tree: Detailed Explana on


A decision tree is a supervised machine learning algorithm used for classifica on and
regression. It mimics human decision-making by following a tree-like structure, where data
splits into branches based on condi ons or features.
1. Structure of a Decision Tree
A decision tree consists of:
 Root Node: The star ng point represen ng the en re dataset.
 Internal Nodes: Points where the dataset is split based on a feature.
 Branches: Pathways between nodes, represen ng a choice or condi on.
 Leaf Nodes: Final nodes where classifica on or regression decisions are made.
Each split in the tree is based on selec ng the best feature, which helps separate data
op mally.
2. Decision Tree Spli ng Criteria
The best split is determined using various algorithms:
A. Classifica on Decision Trees (CART Algorithm)
Used for categorical target variables.
 Gini Impurity: Measures how o en a randomly chosen element is incorrectly classified.

 Entropy (Informa on Gain): Measures the impurity reduc on from a split.

B. Regression Decision Trees


Used for numerical target variables.
 Mean Squared Error (MSE): Evaluates how well a split minimizes errors.

 Variance Reduc on: Splits based on minimizing variance within each group.
3. Advantages of Decision Trees
 Easy to Interpret: Mimics human decision-making.
 No Feature Scaling Needed: Works well with raw data.
 Handles Categorical & Numerical Data: Can process mixed-type data efficiently.
 Robust to Missing Values: Can handle missing a ributes without requiring imputa on.
4. Drawbacks & Solu ons
 Overfi ng: Trees may grow too complex, leading to poor generaliza on. Solu on: Use
pruning techniques (such as post-pruning or pre-pruning).
 Bias in Imbalanced Data: If one class dominates, it affects accuracy. Solu on: Use
balanced datasets or cost-sensi ve learning.
5. Prac cal Applica ons
 Medical Diagnosis: Predic ng diseases based on pa ent symptoms.
 Fraud Detec on: Iden fying fraudulent transac ons in banking.
 Sen ment Analysis: Classifying posi ve vs. nega ve customer reviews.
 Spam Detec on: Filtering spam emails based on keywords and metadata.

Evaluating a Decision Tree


Evaluating a decision tree model involves assessing its accuracy, efficiency, and
ability to generalize to unseen data. Several metrics and techniques help
determine its performance, ensuring it is neither overfitting nor underfitting.
1. Key Evaluation Metrics for Decision Trees
Decision trees can be evaluated using different criteria based on whether they are
used for classification or regression.
A. Classification Decision Trees
For models predicting categories, we use:
 Accuracy: Measures the proportion of correctly classified instances.

 Precision & Recall:


o Precision: Measures how many of the predicted positive cases were
actually positive.

 Recall: Measures how many of the actual positive cases were correctly
identified.

 F1-Score: Balances precision and recall.

 Confusion Matrix: Shows how predictions are distributed across true and
predicted labels.
B. Regression Decision Trees
For models predicting numerical values, we use:
 Mean Squared Error (MSE): Measures the average squared difference
between actual and predicted values.

 Mean Absolute Error (MAE): Measures the absolute difference between


actual and predicted values.

 R² Score (Coefficient of Determination): Measures how well the model


explains variability.

You might also like