0% found this document useful (0 votes)

20 views55 pages

Understanding Topic Models in Machine Learning

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body

Uploaded by

ritesh

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

20 views55 pages

Understanding Topic Models in Machine Learning

Uploaded by

ritesh

We take content rights seriously. If you suspect this is your content, claim it here.

Topic Models

Special course in Unsupervised Machine Learning

University of Helsinki

Guest Lecture
Ali Faisal
[Link]@[Link]
16th May, 2017

Data Scientist
OpusCapita Group Oy
Motivation
● Topic Models has several applications
– Ranging from NLP to Biology
● Example applications
– Language modelling
● text categorization,
● Innovative search engines
speech recognition,.. etc
●

– Demo: Evolution of topic in 40

years of Signs archive
– Evolution of game genre
– Scientific topic evolution PNAS
– Semantic hierarchies
etc etc...
Today's journey - Agenda
● Background of Topic models
● Language Models
● Latent Dirichlet Allocation (LDA)
–Semantic interpretability case
● Demo
● Nonparametric Topic models
–Information retrieval case
● Demo
Background -Example Topics

• Most often used in analyzing text and image collections

• All data is assumed to be generated from a collection of Topics
– Topics are sources that generate elements with certain probabilities

Faisal et al, '12

Background
● Model text as dice rolling:
D = “A child plays and learns while playing

Generates each word independently

p(D) = p(w , w ,.., w ) = ∏ p(w )
d,1 d,2 d,N n d,n

if, p(word) = φword

p(D|φ ) = p(A) p(child) p(play) p(and) p(learns) p(while) p(play)
= p(A) p(child) p(play)2 p(and) p(learns) p(while)
= φA φchild φplay2 φand φleans φwhile
= ∏j φj nj= Multinomial(φ)
Language models
● Unigram Model

p(D) = ∏ p(w )
n d,n

● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

z n n
Language models
● Unigram Model

p(D) = ∏ p(w )
n d,n

One likelihood for the entire text collection

● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

z n n

One topic per document

Bayesian Modelling
●
All methods we are discussing can be interpreted as performing
[Link]. (Unigrams, Mixture of Unigrams, PCA, ICA or FA) or
Bayesian estimation (LDA) in a probabilistic generative model
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)

β k lives on a simplex
Dirichlet distribution - background
Dirichlet(α) where is a multivariate probability distribution over a simplex
PDF:

Induces uniform Topic

distribution when αi = 1

Induces Sparse Topics

when alpha < 1
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)

β k lives on a simplex

Dirichlet prior mitigate over fitting, words that do

not appear in the training set are still assigned
some probability to appear in future documents
Factor Analysis - revisited
Demo
Semantic interpretation – case study
Explosion of data in Genomic databases
Keeping research cumulative is a huge challenge for current data-
driven science

Growth of EBI ArrayExpress database

How to make maximal use of progressively expanding databases?

Retrieval of Relevant Samples
Recipe for retrieval

A background model for the biology provides p(exp)

The retrieval engine finds experiments that share

activated biological processes:
NerV 2D Visualizations of the entire
collection (color-coded with topics)
Problem: Model selection (How to fix the
number of components)

Annealed importance sampler for 1000 iterations over the ArrayExpress

database (~7000 samples)
Systematic retrieval evaluation & Comparison

• Retrieval evaluation shows comparable performance with alternatives.

• Here the gold standard is more refined (Experimental factor ontology) and
represents relationships between experimental factors.

Discounted Cumulative Gain:

How much an investigator gains

when a comparison with particular
relevance is found at particular
rank in the result list of query.

LDA and REx are our model based

approaches

Caldas et al., Bioinformatics, 2012

Nonparametric Topic models
unlike LDA, in nonparametric models we no longer need to pre-specify “k”

Nonparametric find the number of topics “k” automatically from data by

utilizing the amazing Dirichlet process (DP) prior and Hierarchical Dirichlet
Process (HDP) prior
DP Mixture models
A Dirichlet Process (DP) is a distribution over distributions
which can be seen as an infinite dimensional
generalization of the usual Dirichlet distribution
G~DP(α0,G0)
G is a random probability distribution.
G0 is a base measure – a putative mean for G
α0 Is a concentration parameter, controls the amount of variability
around G.
In a Mixture models (MM) each data item xi is associated
with an underlying factor θi with prior given by G:
xi |θi ~ F(θi) θi|G ~ G

25
DP – Stick breaking construction
Sethuraman's (1994) stick breaking construction shows that samples
G~DP(α0,G0) has the form:
∞
G= ∑ k k
G
k =1
G0
≥0, ∑ k= 1
∞
where k k =1 are random
variables depending upon α0
πk = π'k ∏j=1 to k-1 (1 - π'j)
π'k = Beta(1,α0)

Intuitively, consider a stick of length one, at each point we break the

stick. The broken part “πk” is taken as the weight of the corresponding
atom in DP
26
π'1 π'2(1-π'1)
DP Mixture model (cont'd)

i ~ G= ∑ k k
k =1

Associate each Фk with a component

So a DP mixture is a mixture model
with potentially infinite no. of components

27
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.

There is a problem...

28
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.

There is a problem...
Each group is modeled independently
Different groups will never share the same
components if G0 is continuous
Individual atoms are not shared

29
HDP Mixture model Grouped data

The prior on the individual groups can be made

discrete by placing a DP on base distribution G0
G0 ~ DP(γ,H)

The prior induced on G0 and Gj is called a HDP

prior while the model induced on data is a HDP
MM.

30
Stick breaking construction
The factors θji take on values Фk with
probability πjk

This is denoted by indicator

variable zji
The DP priors have the form:
∞
G0= ∑ k k
k=1
∞
G j= ∑ jk k
k=1

So The HDP MM. Is simply a mixture

model where the mixing weights πjk are
dependent on each other via βk.
31
HDP MM vs LDA

HDP MM LDA 32
Comparison LDA (cont'd)
Perplexity over held-out set a dataset of ~6000
biology abstracts

33
Traditional approaches:
Our model vs Multitask HDPLDA
Traditional approaches:
Our model vs Multitask HDPLDA
Low training data: transfer learning

Sharing and strengths of Sharing and strengths of

topics are coupled topics are decoupled
HDP Multi-task IBP-gama Multi-task

Faisal et al, Neurocomputing 2013

Transfer Learning; a model for set of samples

➢ Humans use earlier knowledge of related tasks to perform new

tasks, e.g knowldge about standing helps walking and running.

➢ Transfer learning transfers knowledge from earlier tasks (data-

sets) to a new one and Multi-task learning learn several tasks
together from their respective data sets, exploiting their
underlying relationships.

Faisal et al, Neurocomputing, 2013

Transfer learning: Advantages
● It is a model for data-sets
● Number of topics can be estimated automatically
● It is capable to model weak topic in multi-task problems

● It is a robust and flexible Bayesian generative model

● Outperforms state of the art HDP topic model, specially when the
number of samples is low; scenario central to the multi-task problems
Non-parametric modeling -
conclusions
Avoids model selection and thus saves computation time....
The complexity is comparable to parametric model.
Good choice for count data

39
Information retrieval
making
Biology Cumulative
Objectives

To make data-driven biology more

cumulative
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.

Retrieve a set of earlier datasets where each explains a

certain part of variation in the query dataset.
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.

Retrieve a set of earlier datasets where each explains a

certain part of variation in the query dataset.
Scalable Supermodels

Definition:
We consider several data sources, Di and then their
collection D = { D1, D2, .... DI }. If we compute models for
each dataset, Mi then the model for complete data collection
is: M = f(Mi , θi).

There are at-least two different distributional

assumptions on the whole data.
• Datasets come from a same distribution.
• Datasets come from different distributions.
Model - Trivial
● In the trivial case we assume semi-independent
models
● We approximate joint probability of the query
dataset by a combination of previously obtained
probability distributions
Example Model:
Latent Dirichlet Allocation
A generative model for count data e.g text

46
Characteristics of the trivial model
● Simple and Straightforward:
– We decompose the query model into earlier models using a trivial
supermodel; a model for models that reduces to successive
Bayesian hierarchical learning if the query decomposition
constraints are removed.

● Constraints on the model:

– The dataset should share a library of latent components;
– The model is useful when
● there exists a global library of topics
– If there is an in-house collection of background datasets we can easily build this.
● or prior knowledge is sufficient to be used as latent projections or
components.
Model - Nontrivial
● Here we assume completely independent component models having different latent spaces
● Generalize and do not assume an existing model for the query

● Compute the posterior probability of approximation weights assuming that our approximation family
is correct:

● Optimization scheme to estimate W - two stage convex relaxation to the L-0 or L-1 norm
Faisal et al, PloS ONE, 2014
Are the most cited datasets, most important?
Compare correlation between the importance of each dataset with re-
spect to the no. of times it has been cited.

Characterize the importance by the weighted out degree of a dataset; where the weight is
provided by our method.

● Analyze if there are significant and obvious

differences in the four corners of the scatter plot
using:
– Impact factor of publication venue
– H-index of last author
– Size of the dataset
● Results 1: Upper half of scatter plot:
– Significantly lower impact factor for the left
blue block:
– Avg IF 6.6532 vs 21.9674 pval = 0.00020.
● Results 2: Lower half of scatter plot:
– Significantly higher IF and h-index for right
red block:
– Avg IF values 21.93 vs 4.5 (pval: 0.0129)
– H-index 54.25 vs 21.80 (pval: 0.0053)
Faisal et al, PloS ONE, 2014
Inconsistent annotations in the public databases

●
the arrow tails represent original position of datasets based on original
records in GEO and EBI ArrayExpress
● the head points to newly corrected positions as suggested by our model.
A generic to making research cumulative

Faisal et al, PloS ONE, 2014

Demo
TAKE HOME.....

● A powerful unsupervised machine learning approach

● Can summarize and interpret a huge collection of documents

● Allows us to study evolution of topics over time

● Allows us to study citation patterns, effectively pointing to outliers

– e.g. the supermodel can point out datasets who should cite
whom, or ones where it does not make sense to cite.
–

● Nonparametric extensions automatically infer the correct number

of topics

● Can handle polysemy and synonymy

How does brain work!

Contact
[Link]@[Link]
References
Most results are taken from my articles, available here if article
full text is not available then you can get it from me via email

David Blei's tutorials and lectures are recommended

[Link]

Supervised Hierarchical Dirichlet Process
No ratings yet
Supervised Hierarchical Dirichlet Process
13 pages
Understanding Dirichlet Process Models
No ratings yet
Understanding Dirichlet Process Models
7 pages
Overview of Graphical Models in AI
No ratings yet
Overview of Graphical Models in AI
135 pages
Understanding Latent Dirichlet Allocation
No ratings yet
Understanding Latent Dirichlet Allocation
3 pages
Understanding Topic Models and LDA
No ratings yet
Understanding Topic Models and LDA
77 pages
Correlated Topic Model Overview
No ratings yet
Correlated Topic Model Overview
7 pages
Distributed Gibbs Sampling for LDA
No ratings yet
Distributed Gibbs Sampling for LDA
17 pages
Understanding Latent Dirichlet Allocation
No ratings yet
Understanding Latent Dirichlet Allocation
47 pages
Efficient Learning with Limited Data
No ratings yet
Efficient Learning with Limited Data
6 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
78 pages
Understanding Statistical Natural Language Processing
No ratings yet
Understanding Statistical Natural Language Processing
43 pages
Topic Modeling with LDA and NMF Techniques
No ratings yet
Topic Modeling with LDA and NMF Techniques
57 pages
W7 Topic Modeling Revised
No ratings yet
W7 Topic Modeling Revised
44 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
Dirichlet Conjugate Priors Explained
No ratings yet
Dirichlet Conjugate Priors Explained
71 pages
Structure Learning in Graphical Models
No ratings yet
Structure Learning in Graphical Models
49 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
98 pages
Understanding Mixture Models in Unsupervised Learning
No ratings yet
Understanding Mixture Models in Unsupervised Learning
16 pages
Extracting Philosophy Topics from Reddit
No ratings yet
Extracting Philosophy Topics from Reddit
10 pages
Introduction to Dirichlet Processes
No ratings yet
Introduction to Dirichlet Processes
185 pages
Decision Trees vs. Generative Models in ML
No ratings yet
Decision Trees vs. Generative Models in ML
123 pages
Nonparametric Topic Model DP-MRM
No ratings yet
Nonparametric Topic Model DP-MRM
8 pages
Gibbs Sampling for LDA Implementation
No ratings yet
Gibbs Sampling for LDA Implementation
10 pages
COMP4702 Supervised Learning Notes
No ratings yet
COMP4702 Supervised Learning Notes
23 pages
Hyperparameter Selection for LDA Model
No ratings yet
Hyperparameter Selection for LDA Model
40 pages
Text Clustering Techniques Explained
No ratings yet
Text Clustering Techniques Explained
8 pages
Machine Learning & Data Prep Guide
No ratings yet
Machine Learning & Data Prep Guide
38 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
48 pages
Probability in Generative Models
No ratings yet
Probability in Generative Models
51 pages
LDA: A Comprehensive Tutorial
No ratings yet
LDA: A Comprehensive Tutorial
27 pages
Optimal Topic Selection in LDA Models
No ratings yet
Optimal Topic Selection in LDA Models
9 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
87 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
64 pages
Generative vs. Discriminative Models
No ratings yet
Generative vs. Discriminative Models
16 pages
LDA Topic Modeling for Wikipedia Articles
No ratings yet
LDA Topic Modeling for Wikipedia Articles
93 pages
Hierarchical Dirichlet Processes Explained
No ratings yet
Hierarchical Dirichlet Processes Explained
34 pages
Evaluating Unsupervised Models in CSE291D
No ratings yet
Evaluating Unsupervised Models in CSE291D
55 pages
Probabilistic Modeling in Text Mining
No ratings yet
Probabilistic Modeling in Text Mining
35 pages
Machine Learning in NLP: Concepts & Techniques
No ratings yet
Machine Learning in NLP: Concepts & Techniques
27 pages
Probabilities from Linear Classifiers
No ratings yet
Probabilities from Linear Classifiers
36 pages
Introduction To Probabilistic Models
No ratings yet
Introduction To Probabilistic Models
52 pages
Bayesian Ordinal-Probit Model with Clustering
No ratings yet
Bayesian Ordinal-Probit Model with Clustering
5 pages
Unit-5: 1. Probabilistic Modelling
No ratings yet
Unit-5: 1. Probabilistic Modelling
10 pages
Overview of Generative Learning Algorithms
No ratings yet
Overview of Generative Learning Algorithms
33 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Nonparametric Bayesian Models in NLP
No ratings yet
Nonparametric Bayesian Models in NLP
18 pages
Approaching Any Machine Learning Problem
No ratings yet
Approaching Any Machine Learning Problem
22 pages
Lecture 2 Generative Text Classification
No ratings yet
Lecture 2 Generative Text Classification
27 pages
Understanding Dirichlet Distribution in LDA
No ratings yet
Understanding Dirichlet Distribution in LDA
32 pages
Graphical Models for Internet Data Analysis
No ratings yet
Graphical Models for Internet Data Analysis
306 pages
Oxford Logo Usage and Guidelines
No ratings yet
Oxford Logo Usage and Guidelines
47 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
30 pages
Generative Modeling Fundamentals Guide
No ratings yet
Generative Modeling Fundamentals Guide
108 pages
Understanding Graphical Models in ML
No ratings yet
Understanding Graphical Models in ML
61 pages
Supervised Learning in Web Mining
No ratings yet
Supervised Learning in Web Mining
20 pages
Essential Steps for DS/ML Projects
No ratings yet
Essential Steps for DS/ML Projects
30 pages
Deep Learning: Latent Variable Models
No ratings yet
Deep Learning: Latent Variable Models
68 pages
Bayesian Nonparametric Models Overview
No ratings yet
Bayesian Nonparametric Models Overview
14 pages
Organizational Culture in Management
No ratings yet
Organizational Culture in Management
30 pages
Chakreshwari Devi: Sadhana for Success
No ratings yet
Chakreshwari Devi: Sadhana for Success
2 pages
Benefits of Walking and Urban Transport
No ratings yet
Benefits of Walking and Urban Transport
8 pages
Imitating Eliot: A Poetic Response
No ratings yet
Imitating Eliot: A Poetic Response
7 pages
Understanding Simple Past Tense
No ratings yet
Understanding Simple Past Tense
10 pages
Requirements Analysis for Firefighter Association
No ratings yet
Requirements Analysis for Firefighter Association
4 pages
Metaphor and Simile Creation Guide
No ratings yet
Metaphor and Simile Creation Guide
28 pages
Organizational Assessment Paper
No ratings yet
Organizational Assessment Paper
12 pages
Forces and Their Effect
No ratings yet
Forces and Their Effect
27 pages
Prakrit, Sanskrit, and Tamil Literature
100% (3)
Prakrit, Sanskrit, and Tamil Literature
4 pages
Cloning and the Species Dilemma
No ratings yet
Cloning and the Species Dilemma
22 pages
Debating Genetic Engineering Ethics
No ratings yet
Debating Genetic Engineering Ethics
5 pages
Advances in FD-TD Electromagnetic Modeling
No ratings yet
Advances in FD-TD Electromagnetic Modeling
16 pages
Pearson r Correlation Analysis Example
No ratings yet
Pearson r Correlation Analysis Example
2 pages
Ali Ibn Abu Talib
No ratings yet
Ali Ibn Abu Talib
3 pages
Boulez on Chance in Composition
100% (2)
Boulez on Chance in Composition
13 pages
Lourenco - Recent Developments in Homogenization Techniques PDF
No ratings yet
Lourenco - Recent Developments in Homogenization Techniques PDF
15 pages
BAMS 1st Year Kriya Sharir Assessment
No ratings yet
BAMS 1st Year Kriya Sharir Assessment
2 pages
Hadith on Islam, Iman, and Ihsan
No ratings yet
Hadith on Islam, Iman, and Ihsan
10 pages
Understanding Back Translation Process
No ratings yet
Understanding Back Translation Process
12 pages
The Book of Vessels: A Jewish Oracle
No ratings yet
The Book of Vessels: A Jewish Oracle
5 pages
4 Grade Social Studies: Laws: Joshua Abbott Unit Plan
No ratings yet
4 Grade Social Studies: Laws: Joshua Abbott Unit Plan
40 pages
Patriarchy in South African Education
No ratings yet
Patriarchy in South African Education
5 pages
Renaissance Education and Humanism
No ratings yet
Renaissance Education and Humanism
16 pages
Engleza Juridica
67% (3)
Engleza Juridica
166 pages
Engaging Speech and Writing Starters
No ratings yet
Engaging Speech and Writing Starters
3 pages
Switch Words
0% (1)
Switch Words
2 pages
Analysis of Disney's Frozen Story
0% (1)
Analysis of Disney's Frozen Story
14 pages
British Abuses and U.S. Rights Evolution
No ratings yet
British Abuses and U.S. Rights Evolution
2 pages
Understanding Ambedkar by Lajpat Rai
No ratings yet
Understanding Ambedkar by Lajpat Rai
9 pages

Understanding Topic Models in Machine Learning

Uploaded by

Understanding Topic Models in Machine Learning

Uploaded by

Topic Models

Special course in Unsupervised Machine Learning

– Demo: Evolution of topic in 40

• Most often used in analyzing text and image collections

Top 5 topics for NIPS conference article collection from 1987-99

Faisal et al, '12

Generates each word independently

if, p(word) = φword

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

One likelihood for the entire text collection

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

One topic per document

Induces uniform Topic

Induces Sparse Topics

Dirichlet prior mitigate over fitting, words that do

Growth of EBI ArrayExpress database

How to make maximal use of progressively expanding databases?

A background model for the biology provides p(exp)

The retrieval engine finds experiments that share

Annealed importance sampler for 1000 iterations over the ArrayExpress

• Retrieval evaluation shows comparable performance with alternatives.

Discounted Cumulative Gain:

How much an investigator gains

LDA and REx are our model based

Caldas et al., Bioinformatics, 2012

Nonparametric find the number of topics “k” automatically from data by

Intuitively, consider a stick of length one, at each point we break the

Associate each Фk with a component

The prior on the individual groups can be made

The prior induced on G0 and Gj is called a HDP

This is denoted by indicator

So The HDP MM. Is simply a mixture

Sharing and strengths of Sharing and strengths of

Faisal et al, Neurocomputing 2013

➢ Humans use earlier knowledge of related tasks to perform new

➢ Transfer learning transfers knowledge from earlier tasks (data-

Faisal et al, Neurocomputing, 2013

● It is a robust and flexible Bayesian generative model

To make data-driven biology more

Retrieve a set of earlier datasets where each explains a

Retrieve a set of earlier datasets where each explains a

There are at-least two different distributional

● Constraints on the model:

● Analyze if there are significant and obvious

Faisal et al, PloS ONE, 2014

● A powerful unsupervised machine learning approach

● Can summarize and interpret a huge collection of documents

● Allows us to study evolution of topics over time

● Allows us to study citation patterns, effectively pointing to outliers

● Nonparametric extensions automatically infer the correct number

● Can handle polysemy and synonymy

David Blei's tutorials and lectures are recommended

You might also like