Lecture 4

The document provides an overview of clustering and its application in topic modeling, emphasizing the importance of grouping similar documents without labeled datasets through unsupervised learning. It discusses various types of text clustering, stages involved, and algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Additionally, it highlights real-life applications of topic modeling across different industries, including medical and scientific research.

Uploaded by

Munene Mutuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views32 pages

Lecture 4

Uploaded by

Munene Mutuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CLUSTERING

LECTURE OBJECTIVES
• OBJECTIVES
• Understand clustering
• Application of clustering in topic modeling.
• Demonstrate application of topic modeling in real life
LECTURE OVERVIEW
• CONTENT OVERVIEW
• Definition of text clustering
• Types of text clustering
• Application of text clustering in topic modeling
• Application
TEXT CLUSTERING

CLUSTERING
Group similar documents/items together
Find similarity and relationship pattern
Based on semantic and content
No need of labelled dataset
Unsupervised learning
Efficient way to create category
analyze a lot of data/information
LEVEL OF CLUSTERING
• Level of text clustering
• Document clustering

• Sentence clustering
• Word level clustering- common theme or meanining
TYPES OF CLUSTERING
• Flat clustering

• Hierarchical clustering(Agglomerative and Divisive)

OVERLAP CLUSTERING
ATTRIBUTE/GOAL CLUSTERING
TEXT CLUSTERING
• Stages for clustering
• Pre-processing
• Feature extraction- TF-IDF, word embeddings
• Distance similarity index –
• Euclidean distance similarity
• Cosine similarity
• Jaccard similarity
• Manhattan distance measure

• Clustering algorithms
• Evaluating the clusters
Algorithms for clustering

[Link]
20text%20clustering%20in%20NLP%20On
e%20AI_files/63dbafe2d654d83d4022721c
_clus_algo%20(1).jpg
TEXT CLUSTERING
• Application of clustering
TOPIC MODELLING
• In Technology-knowledge driven economy, collective
knowledge is digitized and stored ie
• News
• Blogs
• Scientific articles
• books,
• Images
• Sound and videos
• Social media
• Need a computational tool to organize, search and
understand these knowledge
Topic modeling
• Ideally documents have multiple topics ( Belei 2003)
• Find the themes in documents
• Words recognition from topic present in corpus or
documents
• Topics that represent a set of documents
• Scan document, detect words, cluster them and come up
with a topic
• Divide corpus of documents into
• Topics covered by the corpus
• Set of documents groups by the topics
Topic modeling
• One aspect we need is thematic structure in the digital
repository
• For large repository, no manpower is enough
• We have machine learning approaches
• Probabilistic topic modelling algorithms
• they use probability and statistical capabilities
• Require no annotation or labelling of documents
Some models for Topic modeling
• Assumes
• Topics have statistical distribution in a documents
• Extraction means checking how strong a topic is
• Topic modelling techniques
• Latent Dirichlet Allocation (LDA).
• Latent semantic analysis (LSA)
• Latent- topics are hidden in a document
• Dirichlet is a probability distribution
Latent Dirichlet Allocation (LDA).
• Topic –distribution over a fixed vocabularies
• LDA is a generative probabilistic model(GPM)
• Data in GPM have hidden variables
• Generative process defines joint probability distribution
(JPD)
• Both hidden and observed variables
• JPD used to calculate the conditional probability of the hidden
variables given observed variables
• Observed variables are the words in the documents
• Hidden variables are the topics in the topics
• The generative process of LDA
Latent Dirichlet Allocation (LDA).
Latent Dirichlet Allocation (LDA).
• Topic evolution and genetic have high probability

Source (Blei 2003)

Latent Dirichlet Allocation (LDA).
How LDA works
• Let have a corpus
• Doc1: I want to farm next weekend.
• Doc2: As I was shopping last week, the Gor maihia club
won against milele club.
• Doc3: I like reading very interesting books especially to do
basketball.
• Doc4: I am taking juice, as I am reading the grammatical
framework book on natural language processing.
• Doc5: My children are watching cartoon as they paint their
holyday work in preparation of school opening.
How LDA works.
• After preprocessing, tokenization, cleaning and removing
stop words.
• Develop a document word matrix

Source: analytics
How LDA works.
• The document-word matrix is converted to two matrices:
Document Term matrix and Topic Word matrix

Source: analytics
Latent semantic analysis
• Based on distribution hypothesis
• From the context of word appearance, semantic of words seized
• Similar semantics of words occur in similar context
• Specifically used for
• Concept searching
• Automated document categorization
• LSA compute frequency of words in a documents(tf-idf)
• LSA assumes
• Similar documents have similar distribution of word frequencies
• ie syntactic and semantic information is similar
Latent semantic analysis
• The calculation of frequency consider
• Frequent words in a documents
• Frequent words in all the documents(corpus)
• Higher frequent words represent the topics than low
frequent
• Frequency of words in a corpus as higher precedence
compared to a a given document frequency
• Td-idf is calculated
• capture polysemy (multiple meanings of a word)
Latent semantic analysis
• Using Td-idf document-term matrix is created
• Rows for the documents in the corpus
• Columns for each term
• The document-term matrix is broken into
• Product of three matrixes
• Using single value decomposition(SVD)
• SVD is a matric factorization techniques
Latent semantic analysis
• Singular value decomposition

A is document-term matrix U is the document-topic matrix

n is unique words ∑ is the diagonal matrix of the
M documents in the corpus singular values
r is the number of topics V is the word-topic matrix
Example of the Topic modelling map
Example of topic modelling
• Repeat this experiment on topic modelling
• [Link]
gensim-python/
Application
• Medical industry
• Scientific research understanding
• Investigation reports
• Recommender System
• Blockchain
• Sentiment analysis
• Text summarisation
• Query expansion which can be used in search engin
conclusion
• Defined clustering
• Types of clustering
• Application of clustering
• Clustering application in topic modeling
• Looked at the models for topic models
• The looked at the application of topic modelling
references
• Blei, David M. "Probabilistic topic models."
Communications of the ACM 55, no. 4 (2012): 77-84.
• [Link]
modelling-in-natural-language-processing/

Understanding Topic Modelling Techniques
No ratings yet
Understanding Topic Modelling Techniques
31 pages
Topic Modeling Techniques in NLP
No ratings yet
Topic Modeling Techniques in NLP
18 pages
Topic Modeling with LDA and NMF Techniques
No ratings yet
Topic Modeling with LDA and NMF Techniques
57 pages
LDA Model for Research Paper Categorization
No ratings yet
LDA Model for Research Paper Categorization
5 pages
W7 Topic Modeling Revised
No ratings yet
W7 Topic Modeling Revised
44 pages
Survey of Probabilistic Topic Modeling
No ratings yet
Survey of Probabilistic Topic Modeling
5 pages
Qta Lse Day8
No ratings yet
Qta Lse Day8
52 pages
Natural Language Processing Course Overview
No ratings yet
Natural Language Processing Course Overview
22 pages
Topic Modeling Techniques Explained
No ratings yet
Topic Modeling Techniques Explained
11 pages
LDA Theory
No ratings yet
LDA Theory
2 pages
Understanding Topic Modeling Techniques
No ratings yet
Understanding Topic Modeling Techniques
31 pages
Topic Modeling in NLP Explained
No ratings yet
Topic Modeling in NLP Explained
14 pages
LDA: A Comprehensive Tutorial
No ratings yet
LDA: A Comprehensive Tutorial
27 pages
Understanding Topic Models and LDA
No ratings yet
Understanding Topic Models and LDA
77 pages
LDA Topic Modeling for Wikipedia Articles
No ratings yet
LDA Topic Modeling for Wikipedia Articles
13 pages
Text Mining Techniques and Tools Guide
No ratings yet
Text Mining Techniques and Tools Guide
78 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
78 pages
Indigen Topic Modeling Insights
No ratings yet
Indigen Topic Modeling Insights
19 pages
Understanding Dirichlet Distribution in LDA
No ratings yet
Understanding Dirichlet Distribution in LDA
32 pages
Topic Modeling Clustering of Deep Webpages
No ratings yet
Topic Modeling Clustering of Deep Webpages
9 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
78 pages
NLP Text Classification Techniques
No ratings yet
NLP Text Classification Techniques
70 pages
Deep Learning for Topic Modeling Clustering
No ratings yet
Deep Learning for Topic Modeling Clustering
11 pages
Introduction to Topic Modeling Techniques
No ratings yet
Introduction to Topic Modeling Techniques
15 pages
Enhancing Topic Models with Latent Features
No ratings yet
Enhancing Topic Models with Latent Features
16 pages
Simple Explanation of LDA Topic Modeling
100% (2)
Simple Explanation of LDA Topic Modeling
13 pages
LDA Algorithm for Topic Modeling
No ratings yet
LDA Algorithm for Topic Modeling
13 pages
Predictive Text Analytics Methods
No ratings yet
Predictive Text Analytics Methods
75 pages
Topic Modeling Techniques and Applications
No ratings yet
Topic Modeling Techniques and Applications
10 pages
Enhancing Topic Modeling with Word2Vec
No ratings yet
Enhancing Topic Modeling with Word2Vec
7 pages
Electronic Basis of The Strength of Materials Illustrated Edition John J Gilman Full
100% (2)
Electronic Basis of The Strength of Materials Illustrated Edition John J Gilman Full
105 pages
TOPIC Analysis
No ratings yet
TOPIC Analysis
9 pages
LDA: A Powerful Topic Modeling Tool
No ratings yet
LDA: A Powerful Topic Modeling Tool
8 pages
Overview of Topic Modeling Methods
No ratings yet
Overview of Topic Modeling Methods
7 pages
Survey of Topic Pattern Mining Techniques
No ratings yet
Survey of Topic Pattern Mining Techniques
7 pages
Building a Simple Markov Model Guide
No ratings yet
Building a Simple Markov Model Guide
54 pages
Top2Vec: Advanced Topic Modeling Techniques
No ratings yet
Top2Vec: Advanced Topic Modeling Techniques
25 pages
Understanding Probabilistic Topic Models
No ratings yet
Understanding Probabilistic Topic Models
16 pages
Correlated Topic Model Development
No ratings yet
Correlated Topic Model Development
8 pages
Introduction to Text Mining Techniques
No ratings yet
Introduction to Text Mining Techniques
19 pages
Topic Modeling and Text Summarization Techniques
No ratings yet
Topic Modeling and Text Summarization Techniques
11 pages
Novel Heuristic for Graph-Based Topic Modeling
No ratings yet
Novel Heuristic for Graph-Based Topic Modeling
9 pages
Chapter 5. Text Clustering and Topic Modeling
No ratings yet
Chapter 5. Text Clustering and Topic Modeling
18 pages
Introduction to Text Mining Techniques
No ratings yet
Introduction to Text Mining Techniques
32 pages
LDA: A Beginner's Guide by Ria Kulshrestha
No ratings yet
LDA: A Beginner's Guide by Ria Kulshrestha
9 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Document Topic Modeling Techniques
No ratings yet
Document Topic Modeling Techniques
48 pages
Topic Models (Latent Dirichlet Allocation)
No ratings yet
Topic Models (Latent Dirichlet Allocation)
6 pages
Text Clustering Techniques Explained
No ratings yet
Text Clustering Techniques Explained
8 pages
Extracting Philosophy Topics from Reddit
No ratings yet
Extracting Philosophy Topics from Reddit
10 pages
Topic Modeling Techniques in NLP
No ratings yet
Topic Modeling Techniques in NLP
48 pages
Practical Feature Engineering Techniques
No ratings yet
Practical Feature Engineering Techniques
57 pages
Topic Modeling and Text Summarization
No ratings yet
Topic Modeling and Text Summarization
21 pages
NLP and Machine Learning Insights
No ratings yet
NLP and Machine Learning Insights
58 pages
ML For NLP - Topic
No ratings yet
ML For NLP - Topic
22 pages
Topic Detection via Keyword Clustering
No ratings yet
Topic Detection via Keyword Clustering
5 pages
Module 4
No ratings yet
Module 4
89 pages
Triplet Transformer for Multi-Label Classification
No ratings yet
Triplet Transformer for Multi-Label Classification
4 pages
Probabilistic Language Modeling Basics
No ratings yet
Probabilistic Language Modeling Basics
41 pages
Evaluating AGI: Insights from BCG
No ratings yet
Evaluating AGI: Insights from BCG
10 pages
ViFi-CLIP: Efficient Video Learning
No ratings yet
ViFi-CLIP: Efficient Video Learning
13 pages
AI Applications in Business Course Overview
No ratings yet
AI Applications in Business Course Overview
27 pages
Oracle 1Z0-1122-25 Exam Guide
No ratings yet
Oracle 1Z0-1122-25 Exam Guide
6 pages
Generative AI Interview Preparation Guide
100% (1)
Generative AI Interview Preparation Guide
138 pages
AI's Impact on Medical Practice Insights
No ratings yet
AI's Impact on Medical Practice Insights
11 pages
Role of AI in Modern Industries
No ratings yet
Role of AI in Modern Industries
16 pages
Visual Mining in Decision Tree Induction
No ratings yet
Visual Mining in Decision Tree Induction
2 pages
Introduction to Artificial Intelligence Basics
No ratings yet
Introduction to Artificial Intelligence Basics
25 pages
AI's Impact on Writers: Friend or Foe?
No ratings yet
AI's Impact on Writers: Friend or Foe?
9 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
26 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
14 pages
Structure and Types of Intelligent Agents
No ratings yet
Structure and Types of Intelligent Agents
20 pages
Question Bank Module 1 and 2
No ratings yet
Question Bank Module 1 and 2
3 pages
Understanding Probabilistic Parsing in NLP
No ratings yet
Understanding Probabilistic Parsing in NLP
30 pages
Understanding Artificial Intelligence Types
No ratings yet
Understanding Artificial Intelligence Types
5 pages
Key Contributors to AI Research 2025
No ratings yet
Key Contributors to AI Research 2025
2 pages
Deep Learning and Fuzzy Logic Review
No ratings yet
Deep Learning and Fuzzy Logic Review
1 page
Generative AI For Educators Glossary
No ratings yet
Generative AI For Educators Glossary
2 pages
Data Science in Civil Engineering
No ratings yet
Data Science in Civil Engineering
29 pages
Fuzzy and Neural Systems Simulation Guide
No ratings yet
Fuzzy and Neural Systems Simulation Guide
2 pages
Multimodal Fake News Detection in Tamil
No ratings yet
Multimodal Fake News Detection in Tamil
29 pages
Azure AI-900 NLP Exam Training Guide
No ratings yet
Azure AI-900 NLP Exam Training Guide
32 pages
Large Language Model
No ratings yet
Large Language Model
8 pages
Android Gen AI Chatbot Guide
No ratings yet
Android Gen AI Chatbot Guide
5 pages
Ultimate Guide to Fine-Tuning LLMs
No ratings yet
Ultimate Guide to Fine-Tuning LLMs
3 pages
Face Emotion Recognition Using CNNs
No ratings yet
Face Emotion Recognition Using CNNs
2 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

CLUSTERING

• Hierarchical clustering(Agglomerative and Divisive)

Source (Blei 2003)

A is document-term matrix U is the document-topic matrix

You might also like