Topic Models
Special course in Unsupervised Machine Learning
University of Helsinki
Guest Lecture
Ali Faisal
[Link]@[Link]
16th May, 2017
Data Scientist
OpusCapita Group Oy
Motivation
● Topic Models has several applications
– Ranging from NLP to Biology
● Example applications
– Language modelling
● text categorization,
● Innovative search engines
speech recognition,.. etc
●
– Demo: Evolution of topic in 40
years of Signs archive
– Evolution of game genre
– Scientific topic evolution PNAS
– Semantic hierarchies
etc etc...
Today's journey - Agenda
● Background of Topic models
● Language Models
● Latent Dirichlet Allocation (LDA)
–Semantic interpretability case
● Demo
● Nonparametric Topic models
–Information retrieval case
● Demo
Background -Example Topics
• Most often used in analyzing text and image collections
• All data is assumed to be generated from a collection of Topics
– Topics are sources that generate elements with certain probabilities
Top 5 topics for NIPS conference article collection from 1987-99
Faisal et al, '12
Background
● Model text as dice rolling:
D = “A child plays and learns while playing
Generates each word independently
p(D) = p(w , w ,.., w ) = ∏ p(w )
d,1 d,2 d,N n d,n
if, p(word) = φword
p(D|φ ) = p(A) p(child) p(play) p(and) p(learns) p(while) p(play)
= p(A) p(child) p(play)2 p(and) p(learns) p(while)
= φA φchild φplay2 φand φleans φwhile
= ∏j φj nj= Multinomial(φ)
Language models
● Unigram Model
p(D) = ∏ p(w )
n d,n
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
and generate all words of the document
from the word distribution of the chosen topic
p(D) = ∑ p(zd) ∏ p(w | zd)
z n n
Language models
● Unigram Model
p(D) = ∏ p(w )
n d,n
One likelihood for the entire text collection
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
and generate all words of the document
from the word distribution of the chosen topic
p(D) = ∑ p(zd) ∏ p(w | zd)
z n n
One topic per document
Bayesian Modelling
●
All methods we are discussing can be interpreted as performing
[Link]. (Unigrams, Mixture of Unigrams, PCA, ICA or FA) or
Bayesian estimation (LDA) in a probabilistic generative model
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)
β k lives on a simplex
Dirichlet distribution - background
Dirichlet(α) where is a multivariate probability distribution over a simplex
PDF:
Induces uniform Topic
distribution when αi = 1
Induces Sparse Topics
when alpha < 1
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)
β k lives on a simplex
Dirichlet prior mitigate over fitting, words that do
not appear in the training set are still assigned
some probability to appear in future documents
Factor Analysis - revisited
Demo
Semantic interpretation – case study
Explosion of data in Genomic databases
Keeping research cumulative is a huge challenge for current data-
driven science
Growth of EBI ArrayExpress database
How to make maximal use of progressively expanding databases?
Retrieval of Relevant Samples
Recipe for retrieval
A background model for the biology provides p(exp)
The retrieval engine finds experiments that share
activated biological processes:
NerV 2D Visualizations of the entire
collection (color-coded with topics)
Problem: Model selection (How to fix the
number of components)
Annealed importance sampler for 1000 iterations over the ArrayExpress
database (~7000 samples)
Systematic retrieval evaluation & Comparison
• Retrieval evaluation shows comparable performance with alternatives.
• Here the gold standard is more refined (Experimental factor ontology) and
represents relationships between experimental factors.
Discounted Cumulative Gain:
How much an investigator gains
when a comparison with particular
relevance is found at particular
rank in the result list of query.
LDA and REx are our model based
approaches
Caldas et al., Bioinformatics, 2012
Nonparametric Topic models
unlike LDA, in nonparametric models we no longer need to pre-specify “k”
Nonparametric find the number of topics “k” automatically from data by
utilizing the amazing Dirichlet process (DP) prior and Hierarchical Dirichlet
Process (HDP) prior
DP Mixture models
A Dirichlet Process (DP) is a distribution over distributions
which can be seen as an infinite dimensional
generalization of the usual Dirichlet distribution
G~DP(α0,G0)
G is a random probability distribution.
G0 is a base measure – a putative mean for G
α0 Is a concentration parameter, controls the amount of variability
around G.
In a Mixture models (MM) each data item xi is associated
with an underlying factor θi with prior given by G:
xi |θi ~ F(θi) θi|G ~ G
25
DP – Stick breaking construction
Sethuraman's (1994) stick breaking construction shows that samples
G~DP(α0,G0) has the form:
∞
G= ∑ k k
G
k =1
G0
≥0, ∑ k= 1
∞
where k k =1 are random
variables depending upon α0
πk = π'k ∏j=1 to k-1 (1 - π'j)
π'k = Beta(1,α0)
Intuitively, consider a stick of length one, at each point we break the
stick. The broken part “πk” is taken as the weight of the corresponding
atom in DP
26
π'1 π'2(1-π'1)
DP Mixture model (cont'd)
i ~ G= ∑ k k
k =1
Associate each Фk with a component
So a DP mixture is a mixture model
with potentially infinite no. of components
27
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
28
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
Each group is modeled independently
Different groups will never share the same
components if G0 is continuous
Individual atoms are not shared
29
HDP Mixture model Grouped data
The prior on the individual groups can be made
discrete by placing a DP on base distribution G0
G0 ~ DP(γ,H)
The prior induced on G0 and Gj is called a HDP
prior while the model induced on data is a HDP
MM.
30
Stick breaking construction
The factors θji take on values Фk with
probability πjk
This is denoted by indicator
variable zji
The DP priors have the form:
∞
G0= ∑ k k
k=1
∞
G j= ∑ jk k
k=1
So The HDP MM. Is simply a mixture
model where the mixing weights πjk are
dependent on each other via βk.
31
HDP MM vs LDA
HDP MM LDA 32
Comparison LDA (cont'd)
Perplexity over held-out set a dataset of ~6000
biology abstracts
33
Traditional approaches:
Our model vs Multitask HDPLDA
Traditional approaches:
Our model vs Multitask HDPLDA
Low training data: transfer learning
Sharing and strengths of Sharing and strengths of
topics are coupled topics are decoupled
HDP Multi-task IBP-gama Multi-task
Faisal et al, Neurocomputing 2013
Transfer Learning; a model for set of samples
➢ Humans use earlier knowledge of related tasks to perform new
tasks, e.g knowldge about standing helps walking and running.
➢ Transfer learning transfers knowledge from earlier tasks (data-
sets) to a new one and Multi-task learning learn several tasks
together from their respective data sets, exploiting their
underlying relationships.
Faisal et al, Neurocomputing, 2013
Transfer learning: Advantages
● It is a model for data-sets
● Number of topics can be estimated automatically
● It is capable to model weak topic in multi-task problems
● It is a robust and flexible Bayesian generative model
● Outperforms state of the art HDP topic model, specially when the
number of samples is low; scenario central to the multi-task problems
Non-parametric modeling -
conclusions
Avoids model selection and thus saves computation time....
The complexity is comparable to parametric model.
Good choice for count data
39
Information retrieval
making
Biology Cumulative
Objectives
To make data-driven biology more
cumulative
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.
Retrieve a set of earlier datasets where each explains a
certain part of variation in the query dataset.
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.
Retrieve a set of earlier datasets where each explains a
certain part of variation in the query dataset.
Scalable Supermodels
Definition:
We consider several data sources, Di and then their
collection D = { D1, D2, .... DI }. If we compute models for
each dataset, Mi then the model for complete data collection
is: M = f(Mi , θi).
There are at-least two different distributional
assumptions on the whole data.
• Datasets come from a same distribution.
• Datasets come from different distributions.
Model - Trivial
● In the trivial case we assume semi-independent
models
● We approximate joint probability of the query
dataset by a combination of previously obtained
probability distributions
Example Model:
Latent Dirichlet Allocation
A generative model for count data e.g text
46
Characteristics of the trivial model
● Simple and Straightforward:
– We decompose the query model into earlier models using a trivial
supermodel; a model for models that reduces to successive
Bayesian hierarchical learning if the query decomposition
constraints are removed.
● Constraints on the model:
– The dataset should share a library of latent components;
– The model is useful when
● there exists a global library of topics
– If there is an in-house collection of background datasets we can easily build this.
● or prior knowledge is sufficient to be used as latent projections or
components.
Model - Nontrivial
● Here we assume completely independent component models having different latent spaces
● Generalize and do not assume an existing model for the query
● Compute the posterior probability of approximation weights assuming that our approximation family
is correct:
● Optimization scheme to estimate W - two stage convex relaxation to the L-0 or L-1 norm
Faisal et al, PloS ONE, 2014
Are the most cited datasets, most important?
Compare correlation between the importance of each dataset with re-
spect to the no. of times it has been cited.
Characterize the importance by the weighted out degree of a dataset; where the weight is
provided by our method.
● Analyze if there are significant and obvious
differences in the four corners of the scatter plot
using:
– Impact factor of publication venue
– H-index of last author
– Size of the dataset
● Results 1: Upper half of scatter plot:
– Significantly lower impact factor for the left
blue block:
– Avg IF 6.6532 vs 21.9674 pval = 0.00020.
● Results 2: Lower half of scatter plot:
– Significantly higher IF and h-index for right
red block:
– Avg IF values 21.93 vs 4.5 (pval: 0.0129)
– H-index 54.25 vs 21.80 (pval: 0.0053)
Faisal et al, PloS ONE, 2014
Inconsistent annotations in the public databases
●
the arrow tails represent original position of datasets based on original
records in GEO and EBI ArrayExpress
● the head points to newly corrected positions as suggested by our model.
A generic to making research cumulative
Faisal et al, PloS ONE, 2014
Demo
TAKE HOME.....
● A powerful unsupervised machine learning approach
● Can summarize and interpret a huge collection of documents
● Allows us to study evolution of topics over time
● Allows us to study citation patterns, effectively pointing to outliers
– e.g. the supermodel can point out datasets who should cite
whom, or ones where it does not make sense to cite.
–
● Nonparametric extensions automatically infer the correct number
of topics
● Can handle polysemy and synonymy
How does brain work!
Contact
[Link]@[Link]
References
Most results are taken from my articles, available here if article
full text is not available then you can get it from me via email
David Blei's tutorials and lectures are recommended
[Link]