0% found this document useful (0 votes)
3 views48 pages

Topic Modeling Techniques in NLP

The document provides an overview of various topic modeling techniques in Natural Language Processing, including Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), and lda2vec. It discusses the theory, mathematics, and implementation of these methods, highlighting their applications in document classification, information retrieval, text summarization, and recommendation systems. Each technique's advantages and limitations are also examined, emphasizing the evolution of topic modeling approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views48 pages

Topic Modeling Techniques in NLP

The document provides an overview of various topic modeling techniques in Natural Language Processing, including Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), and lda2vec. It discusses the theory, mathematics, and implementation of these methods, highlighting their applications in document classification, information retrieval, text summarization, and recommendation systems. Each technique's advantages and limitations are also examined, emphasizing the evolution of topic modeling approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topic Modeling: LSA, PLSA, LDA, HDP & lda2vec

Theory, Mathematics, and Implementation

Dr. Ensaf H. Mohamed


Associate Prof. of Computer Science
Nile University

Fall 2025

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 1 / 48


Overview

1 Introduction
2 Latent Semantic Analysis (LSA)
Steps in LSA
3 Latent Semantic Analysis (LSA)
4 Probabilistic Latent Semantic Analysis (PLSA)
5 Latent Dirichlet Allocation (LDA)
6 Hierarchical Dirichlet Process (HDP)
7 lda2vec
8 Deep Learning: lda2vec
9 Conclusion

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 2 / 48


NLU Hierarchy

In Natural Language Understanding (NLU), we extract meaning


through a hierarchy of lenses:
Words: Morphology, semantics.
Sentences: Syntax, relations.
Paragraphs: Discourse structure.
Documents: Topic Modeling.

At the document level, the most useful way to understand text is


by analyzing its topics.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 3 / 48


What is Topic Modeling?
A sophisticated NLP technique that identifies underlying
topics in text collections.
It is an unsupervised learning technique.
Core Idea:
Each document consists of a mixture of topics.
Each topic consists of a collection of words.
Goal: Uncover latent variables (topics) that shape the
meaning of the document and corpus.

Figure: Visualization of Topic Interconnections


Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 4 / 48
Applications of Topic Modeling

Topic modeling provides valuable insights for various text-based


applications:
1 Document Classification: Automatically categorizing
documents by their dominant topics.
2 Information Retrieval: Improving search accuracy by
understanding the context beyond simple keyword matching.
3 Text Summarization: Extracting key thematic points
efficiently.
4 Recommendation Systems: Personalizing content based on
user interest in specific topics.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 5 / 48


The Topic Modeling Landscape

We will explore the evolution of topic modeling techniques:


1 LSA (Latent Semantic Analysis): The linear algebra
approach.
2 PLSA (Probabilistic LSA): The probabilistic foundation.
3 LDA (Latent Dirichlet Allocation): The Bayesian standard.
4 HDP (Hierarchical Dirichlet Process): The nonparametric
extension.
5 lda2vec: The deep learning hybrid.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 6 / 48


LSA: Concept & Intuition

Core Idea: Words that appear in similar contexts have similar


meanings.

LSA uses Singular Value Decomposition (SVD) to reduce


dimensionality.
It identifies ”hidden” concepts (topics) by compressing the
Term-Document matrix.
In our Example:
We have documents about ”Machine Learning” and ”Data
Science”.
Even if documents don’t share many keywords, LSA sees they
share context (e.g., ”subset”, ”field”).
SVD separates these into orthogonal dimensions (Topic 0 vs
Topic 1).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 7 / 48


7.1.2 Steps Involved in LSA
1 Create a Term-Document Matrix:
Rows = terms (words)
Columns = documents
Entries = frequency of the term in the document (e.g., TF-IDF).
2 Apply Singular Value Decomposition (SVD):
Decomposes the matrix into three smaller matrices: U, Σ, and
VT .
U: Term-concept associations.
Σ: Diagonal matrix of singular values (importance of each
concept).
VT : Document-concept associations.
3 Reduce Dimensionality:
Retain only the top *k* singular values and their corresponding
vectors.
Filters out noise and retains significant patterns.
4 Interpret Topics:
Analyze the reduced matrices to identify underlying topics.
Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 8 / 48
Constructing the Document-Term Matrix

Given m documents and n words.


We construct an m × n matrix A.
Raw Counts: Simple frequency of word j in document i.
Problem with Raw Counts: Common words (”the”, ”and”)
dominate rare, meaningful words.
Solution: Use TF-IDF (Term Frequency-Inverse Document
Frequency).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 9 / 48


TF-IDF Weighting

TF (Term Frequency): How often a word appears in a


document.
IDF (Inverse Document Frequency): How rare the word is
across the entire corpus.
N
wi,j = tfi,j × log
dfj

High weight = frequent in this document, but rare elsewhere


(e.g., ”gentrification”).
Low weight = frequent everywhere (e.g., ”build”).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 10 / 48


Singular Value Decomposition (SVD)

LSA uses Truncated SVD to reduce dimensionality.

A ≈ Ut St VTt

A: Original Document-Term Matrix.


Ut : Document-Topic Matrix (Document vectors).
St : Diagonal matrix of singular values (Topic strength).
VTt : Topic-Term Matrix (Term vectors).
t: Hyperparameter (number of topics).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 11 / 48


Visualizing SVD

We keep only the top t singular values.


This filters out ”noise” and retains the ”signal” (latent topics).

Figure: Dimensionality Reduction via SVD

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 12 / 48


Interpreting LSA Matrices

After decomposition, we can perform vector arithmetic:


Document Similarity: Compare rows in Ut using Cosine
Similarity.
Word Similarity: Compare rows in Vt .
Query Matching: Map a new query into the t-dimensional
space and find the closest documents.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 13 / 48


LSA Implementation: Setup

Using the Data Science and Machine Learning corpus (Chapter 7, Ex


1).
1 corpus = [
2 "Data science is an interdisciplinary field.",
3 "Machine learning is a subset of data science.",
4 "Artificial intelligence is a broader concept than machine learning
.",
5 "Deep learning is a subset of machine learning."
6 ]
7
8 vectorizer = TfidfVectorizer()
9 X = vectorizer.fit_transform(corpus)
10 lsa = TruncatedSVD(n_components=2, random_state=42)
11 [Link](X)

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 14 / 48


LSA Results: Interpreting the Output

The Logic: The algorithm successfully separated the corpus into


two distinct concepts based on term co-occurrence.
Topic 0: Machine Learning Topic 1: Data Science
machine (0.51) data (0.55)
learning (0.51) science (0.55)
Captures the ”technical” interdisciplinary (0.36)
aspect. Captures the ”field” aspect.

Note: LSA can sometimes produce negative weights, which makes


interpretation harder compared to LDA.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 15 / 48


Real Example: Academic Field Sorting
Scenario: You have thousands of unlabelled research paper
abstracts.
LSA in Action

Input Document: ”Deep learning is a subset of machine learning.”

Without LSA: The computer only sees specific words (”Deep”,


”Learning”). It doesn’t know this belongs to ”Data Science”.
With LSA:
The model sees that ”learning” often appears near ”data” and
”science” in other documents.
It projects this document onto the Hidden Topic 1 axis.
Result: It automatically tags this paper as related to ”Data
Science” even if the word ”Data” never appears in the title.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 16 / 48


LSA: Pros and Cons

Advantages
Efficient: Fast dimensionality reduction.
Synonymy: Captures that ”car” and ”auto” are similar.
Noise Reduction: Removes irrelevant variations in data.

Limitations
Linearity: Assumes linear relationships between terms.
Interpretability: Topics can contain negative values, which
are hard to interpret physically.
Polysemy: Struggles with words that have multiple meanings
(e.g., ”bank”).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 17 / 48


Probabilistic Latent Semantic Analysis (PLSA)

PLSA shifts from Linear Algebra to Probability Theory.


Goal: Find a probabilistic model P(D, W) that generates the
observed data.
Instead of minimizing Frobenius norm (like LSA), it maximizes
likelihood.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 18 / 48


PLSA Generative Process

How is a document generated in PLSA?


1 Select a document d with probability P(d).
2 Choose a latent topic z with probability P(z|d).
3 Generate a word w with probability P(w|z).

Joint Probability:

P(d, w) = P(d) P(z|d)P(w|z)
z

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 19 / 48


PLSA Graphical Model

Visualizing the dependencies:


d (observed) → z (latent) → w (observed).
z acts as the bottleneck variable explaining the connection
between documents and words.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 20 / 48


PLSA: Alternative Parameterization
Equivalent parameterization starting from topics:

P(D, W) = P(Z)P(D|Z)P(W|Z)
Z

Comparison to LSA:
P(Z) ≈ Singular values matrix S (Topic Probability).
P(D|Z) ≈ Document-Topic matrix U.
P(W|Z) ≈ Topic-Term matrix V.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 21 / 48


Training PLSA: The EM Algorithm

P(z|d) and P(w|z) are multinomial distributions.


They are latent (hidden), so we cannot count them directly.
Expectation-Maximization (EM) Algorithm:
E-Step: Estimate the posterior probability of latent variables
given current parameters.
M-Step: Maximize the likelihood of the parameters given the
estimated latent variables.
Repeat until convergence.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 22 / 48


Limitations of PLSA

Despite being more flexible than LSA, PLSA has major flaws:
1 No generative model for P(D): We don’t know how to assign
probabilities to new, unseen documents.
2 Overfitting: The number of parameters grows linearly with
the number of documents.

Solution: Latent Dirichlet Allocation (LDA).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 23 / 48


LDA: Concept & Intuition

Core Idea: Generative Probabilistic Model.

Assumption 1: Every document is a mixture of topics (e.g.,


80% NLP, 20% Robotics).
Assumption 2: Every topic is a mixture of words (e.g., NLP
topic has high probability for ”language”, ”text”).
In our Example:
We feed it mixed documents (NLP, Vision, Robotics, Quantum).
LDA ”reverses” the process to find which words likely belong
together.
It uses Dirichlet Priors to enforce sparsity (topics should be
distinct).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 24 / 48


The Dirichlet Distribution

A ”Distribution over Distributions”.


It controls the sparsity of topics.
We want documents to contain only a few topics, not all of
them.
We want topics to contain only a few words, not the whole
dictionary.
The Dirichlet prior enforces this sparsity constraint.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 25 / 48


LDA Generative Process (Step-by-Step)

To generate a document in LDA:


1 Choose a topic distribution θd ∼ Dir(α).
2 For each word n in the document:
1 Choose a topic zdn ∼ Multinomial(θd ).
2 Choose a word wdn ∼ Multinomial(βzdn ), where β comes from
Dir(η).

The goal of training is to reverse this process to find θ and β.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 26 / 48


LDA Plate Notation
α: Prior for document-topic distribution.
β (or η): Prior for topic-word distribution.
θ: Topic distribution for document.
z: Topic assignment for a specific word.
w: The observed word.

Figure: Graphical Model of LDA

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 27 / 48


LDA Implementation: Corpus Setup

Using the NLP, Computer Vision, and Robotics corpus (Chapter 7, Ex


2).
1 corpus = [
2 "Natural language processing enables computers to understand human
language.",
3 "Computer vision allows machines to interpret... visual data.",
4 "Robotics combines engineering and computer science...",
5 "Quantum computing leverages quantum mechanics..."
6 ]
7
8 # Gensim requires Bag-of-Words (BoW) format
9 texts = [[word for word in [Link]().split()] for doc in corpus]
10 dictionary = [Link](texts)
11 corpus_bow = [dictionary.doc2bow(text) for text in texts]

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 28 / 48


LDA Results: Interpreting the Output

The Logic: LDA identifies probability distributions. Words with


high probability in the same topic define that topic.
Topic 0: NLP Focus Topic 1: CS/Engineering
language (0.067) machines (0.070)
natural (0.067) computer (0.070)
processing (0.067) engineering (0.070)
Interpretation: Clearly about Interpretation: General
Linguistics/NLP. Hardware/CS.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 29 / 48


Real Example: Categorizing Tech News
Scenario: A news aggregator needs to label an article about ”AI
Robots”.
LDA in Action

Input Document: ”Robotics combines engineering and computer science


to create intelligent machines.”

LDA Analysis:
The word ”engineering” has a high probability in Topic 1 (Hardware).
The word ”computer science” has probabilities in both, but leans
toward Topic 1 here.
Output Distribution:
The model assigns this document: 90% Topic 1 (Engineering) and
10% Topic 0 (NLP).
Unlike LSA, this gives us a specific percentage breakdown, allowing
for ”fuzzy” categorization.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 30 / 48


LDA: Pros and Cons

Advantages
Generalization: Can assign topics to unseen documents.
Sparsity: Dirichlet priors create cleaner, more focused topics.
Interpretable: Results are usually human-readable.

Limitations
Hyperparameters: Must choose K (number of topics) in
advance.
Bag-of-Words: Ignores word order and syntax.
Computational Cost: Slower than LSA for very large datasets.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 31 / 48


HDP: Concept & Intuition

Core Idea: What if we don’t know the number of topics?

LDA requires us to set K (number of topics) manually.


HDP is Non-parametric: It assumes an infinite number of
potential topics exists.
Chinese Restaurant Process:
New data points (words) can sit at an existing table (topic) or
start a new one.
In our Example: We did not tell HDP there were 2 topics. It
analyzed the ”Climate” and ”Conservation” texts and decided 2
topics were sufficient to explain the data.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 32 / 48


The Chinese Restaurant Process (CRP)

Intuitive analogy for HDP:


Imagine a restaurant with infinite tables (topics).
Customer 1 sits at the first table.
Customer 2 can sit at Table 1 (popularity) or start a new Table
2.
Customer n: Sits at an occupied table with probability
proportional to the number of people already there, or starts
a new table with probability α.

Result: A few large tables (dominant topics) and many small tables
(rare topics).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 33 / 48


HDP Implementation: Setup

Using the Climate Change and Conservation corpus (Chapter 7, Ex 3).


1 corpus = [
2 "Climate change impacts global weather patterns.",
3 "Renewable energy sources reduce carbon emissions.",
4 "Biodiversity is essential for ecosystem balance.",
5 "Conservation efforts protect endangered species."
6 ]
7 # Note: No 'num_topics' parameter needed!
8 hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 34 / 48


HDP Results: Inferred Topics

The Logic: HDP automatically clustered the vocabulary into two


coherent themes without prior instruction.
Inferred Topic 0 (Conservation):
Words: species, conservation, protect, efforts
Context: Documents 3 and 4.
Inferred Topic 1 (Renewable Energy):
Words: emissions, carbon, reduce, energy
Context: Documents 1 and 2.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 35 / 48


Real Example: Unknown Dataset Exploration
Scenario: You scrape 10,000 tweets about ”The Environment”. You
have no idea what people are discussing.

HDP in Action
The Process:
1 Customer 1 (Word ”Carbon”): Sits at Table A.
2 Customer 2 (Word ”Emissions”): Sees ”Carbon” at
Table A, sits there (Topic 1: Energy).
3 Customer 3 (Word ”Species”): Sees Table A, but
”Species” doesn’t fit well with ”Carbon”. Starts Table B
(Topic 2: Biology).
Result: HDP automatically discovers that the conversation
has 2 distinct sub-themes (Energy vs. Biology) without you
telling it to look for 2 topics.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 36 / 48


HDP: Pros and Cons

Advantages
Nonparametric: No need to tune K.
Flexible: Adapts to complexity of data.

Limitations
Complexity: Much harder to implement and understand
mathematically.
Performance: Generally slower than LDA.
Interpretability: Can sometimes produce too many
”micro-topics” or noise.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 37 / 48


lda2vec: The Hybrid Approach

Conventional topic models (LDA) give interpretable topics but


sparse representations.
Word Embeddings (word2vec) give dense, powerful vector
representations but are hard to interpret (black box).
lda2vec combines the best of both worlds.
Developed by Chris Moody (2016).

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 38 / 48


lda2vec: Concept

Core Idea: Combining the interpretability of LDA with the power


of word2vec.

Problem: LDA uses sparse ”Bag-of-Words” (ignores context).


Word2vec uses dense vectors (hard to interpret).
Solution: lda2vec learns dense word vectors and dense topic
vectors simultaneously.
The Equation:

Prediction = f(Word Vector + Document Vector)

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 39 / 48


Real Example: Contextual Meaning (Airline)
Scenario: Analyzing customer complaints where context changes
the meaning of words.

lda2vec in Action
Word: ”Boarding”
Case A (Topic: Service):
Document Vector (Complaint about delay) + Word
Vector (”Boarding”) ≈ ”Chaos / Slow”.
Case B (Topic: IT Support):
Document Vector (App review) + Word Vector
(”Boarding”) ≈ ”Mobile Pass / Glitch”.
Outcome: lda2vec understands that ”Boarding” has a nega-
tive sentiment in Case A but is a technical feature in Case B.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 40 / 48


Review: word2vec (Skip-gram)
A neural network tries to predict context words given a target
word.
w(t) → Predict → w(t − 2), w(t − 1), w(t + 1), w(t + 2).
Learns dense vectors where similar words are close in space.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 41 / 48


lda2vec Architecture

lda2vec modifies the skip-gram prediction:


Instead of predicting context using only the word vector:

Prediction = f(Word Vector)

It uses a combined context vector:

Prediction = f(Word Vector + Document Vector)

This allows the model to understand that the word ”Apple” means
something different in a ”Tech” document vs a ”Food” document.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 42 / 48


Decomposing the Document Vector

The Document Vector is not learned as a black box. It is


decomposed into:
1 Document Weight Vector: How much of each topic is in this
document (percentages).
2 Topic Matrix: The dense vector representation of each topic.

Doc Vector = Doc Weights × Topic Matrix

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 43 / 48


lda2vec Computation Graph
Allows us to visualize words, documents, and topics in the
same vector space.
We can perform math like: Topic1 + ”King” − ”Man” =?

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 44 / 48


lda2vec: Pros and Cons

Advantages
Dense Vectors: Can be used in Deep Learning pipelines
(RNNs, Transformers).
Interpretable: Unlike standard Doc2Vec, the document
vectors are sparse mixtures of interpretable topics.

Limitations
Complexity: Harder to train than LDA.
Hyperparameters: Requires tuning both deep learning
params and topic modeling params.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 45 / 48


Comparison Summary

Model Method Pros Cons


LSA Linear Algebra (SVD) Fast, Handles Synonymy Linearity, Negative values
PLSA Probabilistic Probabilistic interpretation Overfitting, No new docs
LDA Bayesian (Dirichlet) Generalizable, Sparse Must pick K
HDP Nonparametric Infers K automatically Complex, Slower
lda2vec Deep Learning Dense vectors + Topics Training complexity

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 46 / 48


Final Takeaways

Topic modeling is essential for organizing unstructured text


data.
Start with LDA: It is the industry standard, robust, and easy
to use via Gensim.
Use LSA for simple baselines or visualization.
Use HDP if you have absolutely no idea how many topics exist.
Use lda2vec if you need to feed document representations
into downstream neural networks.

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 47 / 48


Resources & Code

The complete Python implementation for LSA, LDA, and HDP is


available in this Google Colab notebook:

https:
//[Link]/drive/1Zxj8CtMW67ZyolZHEhiqrRBidBPJr44U

Dr. Ensaf H. Mohamed - Nile University NLP: Machine Translation 48 / 48

You might also like