0% found this document useful (0 votes)
6 views32 pages

Lecture 4

The document provides an overview of clustering and its application in topic modeling, emphasizing the importance of grouping similar documents without labeled datasets through unsupervised learning. It discusses various types of text clustering, stages involved, and algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Additionally, it highlights real-life applications of topic modeling across different industries, including medical and scientific research.

Uploaded by

Munene Mutuma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Lecture 4

The document provides an overview of clustering and its application in topic modeling, emphasizing the importance of grouping similar documents without labeled datasets through unsupervised learning. It discusses various types of text clustering, stages involved, and algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Additionally, it highlights real-life applications of topic modeling across different industries, including medical and scientific research.

Uploaded by

Munene Mutuma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CLUSTERING

LECTURE OBJECTIVES
• OBJECTIVES
• Understand clustering
• Application of clustering in topic modeling.
• Demonstrate application of topic modeling in real life
LECTURE OVERVIEW
• CONTENT OVERVIEW
• Definition of text clustering
• Types of text clustering
• Application of text clustering in topic modeling
• Application
TEXT CLUSTERING

CLUSTERING
Group similar documents/items together
Find similarity and relationship pattern
Based on semantic and content
No need of labelled dataset
Unsupervised learning
Efficient way to create category
analyze a lot of data/information
LEVEL OF CLUSTERING
• Level of text clustering
• Document clustering

• Sentence clustering
• Word level clustering- common theme or meanining
TYPES OF CLUSTERING
• Flat clustering

• Hierarchical clustering(Agglomerative and Divisive)


OVERLAP CLUSTERING
ATTRIBUTE/GOAL CLUSTERING
TEXT CLUSTERING
• Stages for clustering
• Pre-processing
• Feature extraction- TF-IDF, word embeddings
• Distance similarity index –
• Euclidean distance similarity
• Cosine similarity
• Jaccard similarity
• Manhattan distance measure

• Clustering algorithms
• Evaluating the clusters
Algorithms for clustering

[Link]
20text%20clustering%20in%20NLP%20On
e%20AI_files/63dbafe2d654d83d4022721c
_clus_algo%20(1).jpg
TEXT CLUSTERING
• Application of clustering
TOPIC MODELLING
• In Technology-knowledge driven economy, collective
knowledge is digitized and stored ie
• News
• Blogs
• Scientific articles
• books,
• Images
• Sound and videos
• Social media
• Need a computational tool to organize, search and
understand these knowledge
Topic modeling
• Ideally documents have multiple topics ( Belei 2003)
• Find the themes in documents
• Words recognition from topic present in corpus or
documents
• Topics that represent a set of documents
• Scan document, detect words, cluster them and come up
with a topic
• Divide corpus of documents into
• Topics covered by the corpus
• Set of documents groups by the topics
Topic modeling
• One aspect we need is thematic structure in the digital
repository
• For large repository, no manpower is enough
• We have machine learning approaches
• Probabilistic topic modelling algorithms
• they use probability and statistical capabilities
• Require no annotation or labelling of documents
Some models for Topic modeling
• Assumes
• Topics have statistical distribution in a documents
• Extraction means checking how strong a topic is
• Topic modelling techniques
• Latent Dirichlet Allocation (LDA).
• Latent semantic analysis (LSA)
• Latent- topics are hidden in a document
• Dirichlet is a probability distribution
Latent Dirichlet Allocation (LDA).
• Topic –distribution over a fixed vocabularies
• LDA is a generative probabilistic model(GPM)
• Data in GPM have hidden variables
• Generative process defines joint probability distribution
(JPD)
• Both hidden and observed variables
• JPD used to calculate the conditional probability of the hidden
variables given observed variables
• Observed variables are the words in the documents
• Hidden variables are the topics in the topics
• The generative process of LDA
Latent Dirichlet Allocation (LDA).
Latent Dirichlet Allocation (LDA).
• Topic evolution and genetic have high probability

Source (Blei 2003)


Latent Dirichlet Allocation (LDA).
How LDA works
• Let have a corpus
• Doc1: I want to farm next weekend.
• Doc2: As I was shopping last week, the Gor maihia club
won against milele club.
• Doc3: I like reading very interesting books especially to do
basketball.
• Doc4: I am taking juice, as I am reading the grammatical
framework book on natural language processing.
• Doc5: My children are watching cartoon as they paint their
holyday work in preparation of school opening.
How LDA works.
• After preprocessing, tokenization, cleaning and removing
stop words.
• Develop a document word matrix

Source: analytics
How LDA works.
• The document-word matrix is converted to two matrices:
Document Term matrix and Topic Word matrix

Source: analytics
Latent semantic analysis
• Based on distribution hypothesis
• From the context of word appearance, semantic of words seized
• Similar semantics of words occur in similar context
• Specifically used for
• Concept searching
• Automated document categorization
• LSA compute frequency of words in a documents(tf-idf)
• LSA assumes
• Similar documents have similar distribution of word frequencies
• ie syntactic and semantic information is similar
Latent semantic analysis
• The calculation of frequency consider
• Frequent words in a documents
• Frequent words in all the documents(corpus)
• Higher frequent words represent the topics than low
frequent
• Frequency of words in a corpus as higher precedence
compared to a a given document frequency
• Td-idf is calculated
• capture polysemy (multiple meanings of a word)
Latent semantic analysis
• Using Td-idf document-term matrix is created
• Rows for the documents in the corpus
• Columns for each term
• The document-term matrix is broken into
• Product of three matrixes
• Using single value decomposition(SVD)
• SVD is a matric factorization techniques
Latent semantic analysis
• Singular value decomposition

A is document-term matrix U is the document-topic matrix


n is unique words ∑ is the diagonal matrix of the
M documents in the corpus singular values
r is the number of topics V is the word-topic matrix
Example of the Topic modelling map
Example of topic modelling
• Repeat this experiment on topic modelling
• [Link]
gensim-python/
Application
• Medical industry
• Scientific research understanding
• Investigation reports
• Recommender System
• Blockchain
• Sentiment analysis
• Text summarisation
• Query expansion which can be used in search engin
conclusion
• Defined clustering
• Types of clustering
• Application of clustering
• Clustering application in topic modeling
• Looked at the models for topic models
• The looked at the application of topic modelling
references
• Blei, David M. "Probabilistic topic models."
Communications of the ACM 55, no. 4 (2012): 77-84.
• [Link]
modelling-in-natural-language-processing/

You might also like