CLUSTERING
LECTURE OBJECTIVES
• OBJECTIVES
• Understand clustering
• Application of clustering in topic modeling.
• Demonstrate application of topic modeling in real life
LECTURE OVERVIEW
• CONTENT OVERVIEW
• Definition of text clustering
• Types of text clustering
• Application of text clustering in topic modeling
• Application
TEXT CLUSTERING
CLUSTERING
Group similar documents/items together
Find similarity and relationship pattern
Based on semantic and content
No need of labelled dataset
Unsupervised learning
Efficient way to create category
analyze a lot of data/information
LEVEL OF CLUSTERING
• Level of text clustering
• Document clustering
• Sentence clustering
• Word level clustering- common theme or meanining
TYPES OF CLUSTERING
• Flat clustering
• Hierarchical clustering(Agglomerative and Divisive)
OVERLAP CLUSTERING
ATTRIBUTE/GOAL CLUSTERING
TEXT CLUSTERING
• Stages for clustering
• Pre-processing
• Feature extraction- TF-IDF, word embeddings
• Distance similarity index –
• Euclidean distance similarity
• Cosine similarity
• Jaccard similarity
• Manhattan distance measure
• Clustering algorithms
• Evaluating the clusters
Algorithms for clustering
[Link]
20text%20clustering%20in%20NLP%20On
e%20AI_files/63dbafe2d654d83d4022721c
_clus_algo%20(1).jpg
TEXT CLUSTERING
• Application of clustering
TOPIC MODELLING
• In Technology-knowledge driven economy, collective
knowledge is digitized and stored ie
• News
• Blogs
• Scientific articles
• books,
• Images
• Sound and videos
• Social media
• Need a computational tool to organize, search and
understand these knowledge
Topic modeling
• Ideally documents have multiple topics ( Belei 2003)
• Find the themes in documents
• Words recognition from topic present in corpus or
documents
• Topics that represent a set of documents
• Scan document, detect words, cluster them and come up
with a topic
• Divide corpus of documents into
• Topics covered by the corpus
• Set of documents groups by the topics
Topic modeling
• One aspect we need is thematic structure in the digital
repository
• For large repository, no manpower is enough
• We have machine learning approaches
• Probabilistic topic modelling algorithms
• they use probability and statistical capabilities
• Require no annotation or labelling of documents
Some models for Topic modeling
• Assumes
• Topics have statistical distribution in a documents
• Extraction means checking how strong a topic is
• Topic modelling techniques
• Latent Dirichlet Allocation (LDA).
• Latent semantic analysis (LSA)
• Latent- topics are hidden in a document
• Dirichlet is a probability distribution
Latent Dirichlet Allocation (LDA).
• Topic –distribution over a fixed vocabularies
• LDA is a generative probabilistic model(GPM)
• Data in GPM have hidden variables
• Generative process defines joint probability distribution
(JPD)
• Both hidden and observed variables
• JPD used to calculate the conditional probability of the hidden
variables given observed variables
• Observed variables are the words in the documents
• Hidden variables are the topics in the topics
• The generative process of LDA
Latent Dirichlet Allocation (LDA).
Latent Dirichlet Allocation (LDA).
• Topic evolution and genetic have high probability
Source (Blei 2003)
Latent Dirichlet Allocation (LDA).
How LDA works
• Let have a corpus
• Doc1: I want to farm next weekend.
• Doc2: As I was shopping last week, the Gor maihia club
won against milele club.
• Doc3: I like reading very interesting books especially to do
basketball.
• Doc4: I am taking juice, as I am reading the grammatical
framework book on natural language processing.
• Doc5: My children are watching cartoon as they paint their
holyday work in preparation of school opening.
How LDA works.
• After preprocessing, tokenization, cleaning and removing
stop words.
• Develop a document word matrix
Source: analytics
How LDA works.
• The document-word matrix is converted to two matrices:
Document Term matrix and Topic Word matrix
Source: analytics
Latent semantic analysis
• Based on distribution hypothesis
• From the context of word appearance, semantic of words seized
• Similar semantics of words occur in similar context
• Specifically used for
• Concept searching
• Automated document categorization
• LSA compute frequency of words in a documents(tf-idf)
• LSA assumes
• Similar documents have similar distribution of word frequencies
• ie syntactic and semantic information is similar
Latent semantic analysis
• The calculation of frequency consider
• Frequent words in a documents
• Frequent words in all the documents(corpus)
• Higher frequent words represent the topics than low
frequent
• Frequency of words in a corpus as higher precedence
compared to a a given document frequency
• Td-idf is calculated
• capture polysemy (multiple meanings of a word)
Latent semantic analysis
• Using Td-idf document-term matrix is created
• Rows for the documents in the corpus
• Columns for each term
• The document-term matrix is broken into
• Product of three matrixes
• Using single value decomposition(SVD)
• SVD is a matric factorization techniques
Latent semantic analysis
• Singular value decomposition
A is document-term matrix U is the document-topic matrix
n is unique words ∑ is the diagonal matrix of the
M documents in the corpus singular values
r is the number of topics V is the word-topic matrix
Example of the Topic modelling map
Example of topic modelling
• Repeat this experiment on topic modelling
• [Link]
gensim-python/
Application
• Medical industry
• Scientific research understanding
• Investigation reports
• Recommender System
• Blockchain
• Sentiment analysis
• Text summarisation
• Query expansion which can be used in search engin
conclusion
• Defined clustering
• Types of clustering
• Application of clustering
• Clustering application in topic modeling
• Looked at the models for topic models
• The looked at the application of topic modelling
references
• Blei, David M. "Probabilistic topic models."
Communications of the ACM 55, no. 4 (2012): 77-84.
• [Link]
modelling-in-natural-language-processing/