Supervised Hierarchical Dirichlet Process
Supervised Hierarchical Dirichlet Process
Abstract—We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint
distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with
another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our
method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models
based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored;
these models allow flexibility in modelling nonlinear relationships. However, until now, hierarchical Dirichlet process (HDP) mixtures
have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped
data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt
jointly from the group structure and from the label assigned to each group.
Index Terms—Bayesian nonparametrics, hierarchical Dirichlet process, latent Dirichlet allocation, topic modelling
1 INTRODUCTION
around terms that are predictive of document responses. specifically for grouped data, and we go on in Section 3.2 to
For example, in sentiment analysis tasks, the topics learnt give an introduction to generalised linear models (GLMs),
consist of terms that cause the document to have positive or and review the sLDA model (Section 3.3), both of which are
negative sentiment. Similarly, for financial news, the topics important in the later parts of the paper. We then introduce
consist of terms that have positive or negative effects in the the supervised HDP model in Section 4. Section 5 describes
market. In contrast, unsupervised LDA learns topics that the inference algorithms that are used to sample from the
are in line with the general theme of the documents, but are posterior of the new model. Finally, Section 6 covers experi-
often unrelated to the document responses. Blei and ments with this model on real-world data sets consisting of
McAuliffe [5] found that the predictions made by sLDA for both binary and continuous responses and compares the
the responses of an unseen test set were better than the pre- new model to existing models.
dictions made using the unsupervised topics inferred by
LDA. However, they found that the sLDA model only per- 2 PROBLEM DESCRIPTION
formed slightly better than LASSO regression on the empiri-
In this section we outline the structure of the problems for
cal distribution of words for each document.
which this work is relevant. First, we assume that there is a
Although supervised topic models perform well, they are
set of data points divided into D groups. Second, to reduce
limited as the number of topics in the model must be fixed
complexity, we should be willing to assume a bag of words
in advance. This can lead to overfitting in sLDA when there
representation can be used for each group, which amounts
are too many topics and regression parameters in the model
to assuming exchangeability among the observations within
so that topics are relatively specific and do not generalise
a group. Each group i consists of both a variable number of
well to unseen observations. Underfitting is the opposite
data points xij ; j ¼ 1; . . . ; Ni , which are the predictors, and a
case when there are too few topics and regression parame-
single response yi . Given a set of training examples with
ters in the model so unrelated observations are assigned
predictors and associated responses, the task is to predict
together to the same topic. Another characteristic is the rela-
the responses on a separate test set of predictors. In the case
tive contribution of the supervised and unsupervised com-
of document modelling, D is the number of documents in
ponents to the model. In a fixed parametric setting it can be
the corpus, each word uses one-of-V encoding
that one or the other (typically the unsupervised part) of
xij 2 f1; . . . ; V g where V is the size of the vocabulary of the
these signals may dominate the likelihood, which deter-
corpus. yi is the response for the document, such as a rating
mines the topic features. In a nonparametric setting, any
or a category. In the rest of this paper, the problem and
dominant individual signal is captured by a set of compo-
models will be described in terms of documents and words,
nents, leaving the remaining joint topic and supervised sig-
but all the models can also be used on other kinds of
nal to be captured by as many additional components as are
grouped data.
relevant.
A number of methods can be used to choose the num-
ber of topics, including cross-validation and model com- 3 BACKGROUND
parison techniques; however, these are slow as the In this section we outline previous work and other methods
algorithm has to be restarted a number of times and that will be used in this paper. Many of these approaches
choosing the ideal number of topics from the runs can be utilise Bayesian nonparametric models to gain more flexibil-
difficult. Bayesian nonparametric methods have emerged ity than parametric models.
as a good way to extend these models naturally to handle Due to their flexibility, there has been interest in super-
a flexible number of topics. vised nonparametric models, such as the regression models
The nonparametric supervised HDP (sHDP) model is of Gaussian processes (GPs) [6] and Bayesian regression
presented in this paper. The sHDP model is a generative trees. Dirichlet processes have also been adapted for super-
supervised model that has an infinite number of topics (or vised problems. An example of this is the Dirichlet process
clusters) that can be used to predict a document response. multinomial logit model (dpMNL) [7]. In this generative
The sHDP model is a nonparametric extension of the super- model, the relationship between the covariates and
vised topic model (sLDA) [5]. The main contribution of the responses are modelled jointly using Dircihlet process mix-
model is that it overcomes the issue of choosing the fixed tures. Although within each cluster the relationship is
number of topics that is necessary for sLDA. The fact that assumed to be linear, an overall nonlinear relationship
the model has an infinite number of topics also reduces the occurs when the model has more than one cluster. A multi-
problems of underfitting and overfitting. The sHDP can also nomial logit is used to model the responses conditionally on
be considered a supervised extension of the HDP mixture the covariates within each cluster. Thus, the regression
model described in Section 3.1. In this paper, we show that parameters of the logit model are different for each cluster.
sHDP performs better than sLDA on one data set or compa- The predicted responses are conditional on the parameters
rable to sLDA with the best performing number of topics and the covariates. The dpMNL model was tested on pro-
(chosen post-hoc) on two out of three other data sets (see, tein fold classification, and compared with existing methods
e.g., Fig. 2). based on neural networks and support vector machines.
The rest of the paper is organised as follows. Section 2 The results showed that the dpMNL model performed sig-
sets the problem and the form of the data for the models nificantly better.
proposed in this paper. Then in Section 3, we briefly review The dpMNL has been extended to model additional
some existing work on tackling the supervised learning response types with DP mixtures of generalised linear mod-
problem with nonparametric models and also approaches els (DP-GLM) [8]. Whereas the dpMNL only explicitly
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 245
models discrete responses, the DP-GLM can generatively G0 jg; H DPðg; HÞ; (1)
model both continuous and discrete responses using differ-
ent generalised linear models. Again, the regression coeffi- where H is the base probability measure and g is the con-
cients of the generalised linear models are different for each centration parameter.
cluster. Priors are also placed on the coefficients, resulting The random measures for each group i are conditionally
in a regularised model for the response. The model was independent given the global measure
shown to have weak consistency by Hannah et al. [8], and
Gi ja0 ; G0 DPða0 ; G0 Þ; (2)
the performance was shown to be comparable to a Gaussian
process model. where a0 is a concentration parameter. The distribution G0
Neither the dpMNL nor the DP-GLM has, to our varies around H by an amount controlled by g and the dis-
knowledge, been applied to the problem of predicting the tribution Gi in group i varies around G0 by an amount con-
responses of groups of observations. The supervised topic trolled by a0 . This can be seen as adding another level of
model (sLDA) is one approach to tackling this prediction smoothing on top of DP mixture models. Let ui1 ; ui2 ; . . . be
problem for grouped data (e.g., documents). sLDA learns i.i.d. variables distributed to Gi and each of these variables
topics that are able to model the document responses is a parameter that corresponds to an observation xij , the
more accurately. The sLDA model has, however, limited likelihood of these observations being
flexibility since the number of latent topics must be fixed
in advance leaving it at risk of overfitting or underfitting. uij jGi Gi ; (3)
There has also been work on other methods of learning xij juij F ðuij Þ; (4)
the regression coefficients or other response types such as
DMR [9], MedLDA [10] and labeled LDA [11], however, where F ðuij Þ is the distribution of xij given uij . This prior
these models still have a fixed number of topics. results in a DP being associated with each group in the
Hierarchical Dirichlet process (HDP) mixture models, model where the DPs are conditionally independent given
described in Section 3.1, are a type of Bayesian nonparamet- their parent and the parameters drawn in the parent node
ric model that can be used instead of LDA for topic model- are shared among the descendant groups. This structure
ling. They are commonly used as the nonparametric analog can be extended to multiple levels.
to LDA, allowing for flexible topic modelling without being The HDP requires that the data be in a pre-defined
restricted to a fixed number of topics. Though inference is nested structure. The HDP model has been used in informa-
more complex, Gibbs sampling and variational Bayes tech- tion retrieval tasks and used in relation with traditional TF-
niques can still be applied. Until now, HDP mixtures have IDF measures [14] for measuring the score of documents in
not seen significant use in supervised problems and suffer relation to a query. There are variants of HDP that model
the same problems as unsupervised LDA in that the topics topics for documents where there is no predefined hierar-
learnt are not necessarily predictive of the responses. The chical structure (see, e.g., [15]).
sHDP model we present in this paper extends the HDP mix-
ture model to learn topics that are good predictors of docu-
3.1.1 Similarity to LDA
ment responses.
With the appropriate base measure, the HDP can be thought
3.1 Hierarchical DPs of as the infinite analogue of LDA. In the HDP, the base
A DP [1], [12] is a stochastic process that can be thought of probability measure allows for a countably infinite number
as a probability distribution on the space of probability of multinomial draws and so an infinite number of topics.
measures. The name of the process accurately describes the This allows the number of topics to grow or shrink accord-
fact that the DP results in finite-dimensional Dirichlet mar- ing to the data. This solves the problem of finding the best
ginal distributions, similar to the Gaussian process that has number of topics in LDA and reduces the problems of over-
Gaussian distributed finite-dimensional marginal distribu- fitting or underfitting due to a fixed number of topics.
tions. DPs are commonly used as a prior on the space of
probability measures, which give wider support and so 3.2 Generalised Linear Models
improved flexibility over using traditional parametric fami- Often when a response is not an unconstrained continuous
lies as priors. In addition, DPs also have tractable posteriors variable, it is transformed into one and a normal linear
so making them important in Bayesian nonparametric prob- model is used for it. However, this may not always be
lems. A DP is defined in terms of a base measure and a con- appropriate. A generalised linear model [16] expands the flexi-
centration parameter. Each draw from the DP is itself a bility of linear regression by being capable of analysing data
measure. Since there is a positive probability of drawing a where either there may not be a linear relation between the
previously drawn value, the draws are discrete with proba- covariates x and the response y or where a Gaussian
bility 1. This makes them very useful for clustering in DP assumption for y is inappropriate. Given parameters h, and
mixtures. covariates x, a generalised linear model is specified by a lin-
The HDP [13] is a hierarchical extension to DPs. The ear predictor which we denote in this section by r ¼ hT x, a
hierarchical structure provides an elegant way of sharing link function gðÞ that relates the linear predictor to the
parameters. This process defines a set of probability mean m of the response m ¼ g1 ðrÞ and a probability distri-
measures Gi for D pre-specified groups of data and a bution from the exponential family that gives the distribu-
global probability measure G0 . The global measure is dis- tion of the response y with mean EðyjÞ ¼ m. In this paper,
tributed as we only consider canonical link functions though others can
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015
be used when needed. The canonical link function is a coefficients. Exchangeability of the topic assignments
choice of link function such that r is the natural parameter imply posterior parameter symmetries in the GLM model,
in the exponential family distribution. The distribution of were a full Bayesian solution obtained. However if we are
the response may also be an exponential dispersion family constrained to a MAP inferential setting there is no possi-
that has an additional dispersion parameter denoted as d. bility of parameter symmetry and so there is symmetry
We denote this as ExpFamðm; dÞ. The generalised linear breaking of the exchangeability of topic assignments. The
model for response y takes the form inference process used must be sensitive to this broken
symmetry. An inference process that considers the genera-
ry AðrÞ tion process for the document contents first enables consis-
pðyjr; dÞ ¼ hðy; dÞ exp ; (5)
d tent topic labels to be determined. Then the document’s
response is chosen conditional on those contents, and
where AðrÞ the log-normaliser.
hence on topic labels that have a consistent meaning. An
Different forms of responses can be modelled using dif-
alternative to this is to choose a model where y is regressed
ferent choices of h and A. In particular, there is a Gaussian
on the topic proportions for the document, #. However,
distribution on y,
this may result in some topics being estimated that just
explain the response variables while other topics only
1 1
pðyjr; dÞ ¼ pffiffiffiffiffiffiffiffi exp ðy rÞ2 ; (6) explain the document words.
2pd 2d
In the sLDA model, the parameters a, f1:K , h and d are
pffiffiffiffiffiffiffiffi 2 treated as constants to be estimated. Approximate maxi-
when hðy; dÞ ¼ ð1= 2pdÞey =2 and AðrÞ ¼ r2 =2. This is a
mum-likelihood estimation is then performed with a varia-
normal linear model with a mean of r and variance of d.
tional expectation-maximisation (EM) method, similar to
When y is binary, a binomial distribution can be used
that for LDA. Collapsed Gibbs sampling can also be used
with the number of trials n ¼ 1, so that y is distributed as
for inferring the topics jointly as in LDA.
The models we propose in this paper solve the issue
pðyjrÞ ¼ ry ð1 rÞ1y ; (7)
sLDA has of requiring the number of topics to be fixed from
which uses the canonical logit link function gðrÞ ¼ the start. This can result in overfitting or underfitting if the
lnðr=ð1 rÞÞ and the binomial distribution for y. This number of topics is unsuitable for the data set. Though the
choice of distribution and link function results in a logis- number of topics can be chosen based on a training set, the
tic regression model. process can be difficult and time consuming.
G0 DPðgHÞ (8)
Gi DPðaG0 Þ (9)
ij ; u ij Gi
uij ¼ uX Y
(10)
X
xij juX
ij f uij (11)
yi juuYi ExpFamðg1 uYi ; dÞ; (12)
P
where uYi ¼ ð1=Ni Þ j uYij is the linear predictor for the GLM,
g is its link function and d is the dispersion parameter for the
exponential family distribution. i ranges over each docu-
Fig. 1. The supervised HDP model where the observed variables are the
ment, j ranges over each observation in that document, g
words wij denoting word j in document i and the document label yi .
denotes the concentration parameter for the corpus-level
DP and a denotes the concentration parameter for the docu-
a) Draw a distribution over topics Ti DPða; T0 Þ.
ment-level DP. The base measure H ¼ H Y H X consists of
b) For each word wij ,
a measure for the regression parameters uY H Y and
another measure for the topic parameters uX H X . G0 is i) Draw a topic ðuuX ij ; uij Þ Ti .
Y
the corpus-level random measure that acts as the base mea- ii) Draw a word w MultinomialðuuX ij Þ.
sure for the document-level random measure Gi . c) Draw a response for the document y ExpFam
P
Due to the clustering property of the DP, some data ðg1 ðuuYi Þ; dÞ where uYi ¼ ð1=Ni Þ j uYij .
points will share the same parameters u, which can be repre- The sHDP learns topics that both model document con-
sented as those data points being assigned to the same topic. tents well and are predictive of document responses with-
The prior density for the regression parameters is typically out the need to choose a fixed number of topics beforehand.
H Y ¼ Nð0; zÞ. For topic modelling, the documents consist of This structured approach to supervision allows the model
words, and the prior density for the cluster parameters is to be easily extended to incorporate additional information
H X ¼ Dirichletðaw Þ, where aw is the parameter for a sym- from documents to aid in predicting the response such as
metric Dirichlet distribution. f is the likelihood of uX given the authors of a document or the research group which
the observations x. In a topic modelling problem, authored a document, which can be inferred through the
fðuX Þ ¼ MultinomialðjuX Þ. When coupled with its conjugate grouped author-topic model [17]. For example, there is the
prior, the Dirichlet distribution, the topic parameters uX can problem of predicting the venue where a paper is published
be integrated out, allowing for collapsed Gibbs inference to by learning the venues where the research group has previ-
be performed by just keeping track of the word to topic allo- ously published. Another example could be the problem of
cations and the regression coefficients for the topics. The predicting a set of keywords or categories for a paper by
GLM model for the responses allows the responses to be learning which categories have previously been picked by
continuous, ordinal, categorical and other types depending the research group for those topics. Allowing for the topics
on the form of the GLM. If the base measure for the coeffi- to be supervised can also give more control over the types
cients H Y is chosen to be Gaussian, the maximum a posteri- of topics that are learnt by the sHDP in case the unsuper-
ori (MAP) solution for the coefficients is similar to the vised topics are not interesting for a particular task. Finally,
solution for L2 penalised regression. A graphical model is the sHDP model allows for unlabelled data to be used as
shown in Fig. 1. part of the training set in semi-supervised problems. This
The generative process for the full model is: allows supervised topics to be learnt that take into account
the content of unlabelled documents so that the topics can
1) Draw (from their prior distributions) the concentra-
better model the entire corpus instead of just the labelled
tion parameters, g, for the global DPs. Likewise,
documents.
draw the concentration parameters, a, for the lower-
level DPs from their priors.
2) Draw a global distribution over topics and their 5 INFERENCE
regression coefficients T0 DPðg; HÞ. Since posterior inference is intractable in DP-based models,
3) Now for each document i, approximations must be used. Collapsed Gibbs sampling is
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
248 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015
the most common technique used to sample from the poste- 2) Sample b from
riors of these models, and it can also be applied to the model
described in this paper. For topic modelling problems, the ðb1 ; . . . ; bK ; bnew Þ Dirichletðm1 ; . . . ; mK ; gÞ: (15)
Dirichlet base measure for each topic, which is a distribu-
For a continuous response assuming g ¼ 1,
tion over the vocabulary, is conjugate to the multinomial
likelihood for the words. This enables the topic parameters pðyd jz; hÞ / expððyd h> zÞ2 Þ; (16)
to be integrated out. Thus at each iteration and based on the
Chinese restaurant process, collapsed Gibbs sampling can and for a binomial response where yd 2 f0; 1g,
be used to sample the topic allocations. The regression coef-
ficients can then be sampled from their posteriors in some h> zÞyd ð1 h> zÞ1yd :
pðyd jz; hÞ ¼ ðh (17)
cases such as for a Gaussian response and can be approxi-
mated in other cases. The following sections describe infer- During prediction, the posterior of z is needed over the
ence in the proposed sHDP model. test documents. This is calculated by removing the terms
that depend on the response y from the conditional distribu-
5.1 The sHDP Model tions so that inference on the test documents is identical to
unsupervised sHDP. The posterior for the test samples can
Since the base measure for the topic regression coefficients
be sampled by replacing (13) with
will not in general be conjugate to the response model, the
non-conjugate auxiliary variable sampling algorithm 8 ij wij
>
< ðnik þ abk Þfk ðwij Þ;
(Algorithm 8) described by Neal [18] is used to sample the
topic allocations, zij where zij ¼ k indicates that word wij pðzij ¼ kjzij ; wij ; bÞ / if k ¼ zi0 j0 for some ði0 ; j0 Þ 6¼ ði; jÞ;
>
:
is allocated to topic k. The main difference from inference abnew fnew ðwij Þ; if k ¼ knew
for the HDP mixture model is in sampling the topic alloca- (18)
tion variable and the topic regression coefficients. The con-
ditional distribution for the topic allocation has an and sampling the allocations and counts for the test
additional term for the conditional likelihood of the topic documents.
parameters given the document response. Gibbs sampling
proceeds as below: 5.2 Parameter Posteriors and Prediction
The topic regression coefficients are sampled after each
1) For each document i, round of sampling the topic assignments. We also per-
a) Let nik denote the number of words in document formed experiments where the topic assignments were sam-
i allocated to topic k, and let a superscript ij pled for several rounds in between sampling the regression
for a variable denote the terms excluding the coefficients but this made little difference to prediction per-
ijth term. For each word wij , sample the topic formance. The topic coefficients can be updated for sHDP
allocation zij using by regressing only on the topics that are allocated to at least
one observation. We will describe cases for a Gaussian and
pðzij ¼ kjzij ; wij ; bÞ binary response in this section, though other models for the
8 ij w response can be used too.
>
> ðnik þ abk Þfk ij ðwij Þpðyi jzij ; zij ¼ k; hÞ;
>
< if k ¼ z 0 0 for some ði0 ; j0 Þ 6¼ ði; jÞ
ij (13)
/ 5.2.1 Gaussian Model
>
> ij
ab fnew ðwij Þpðyi jz ; zij ¼ k; h Þ; new
>
: new new In the Gaussian model, we place a Gaussian prior on the
if k ¼ k ; regression coefficients. The model response can be rewrit-
new new ten as
where hnew ¼ ðh h; hk Þ, hk Nð0; zÞ, fk is the
distribution of the word given the other words y ¼ Xh
h þ c; (19)
allocated to topic k and fnew is the probability of
the word in an empty topic. where y is a length-D vector of document responses, X is a
If a new topic knew is sampled during one of D 1 matrix of cluster to document allocation counts, h is
the steps above, then draw b Betað1; gÞ, set the a vector of regression parameters for each topic and c are
new weight bknew ¼ bbnew and set the new bnew to the residuals. Let X be the matrix where row d is the empiri-
ð1 bÞbnew . The value b corresponds to the cal topic distribution for document d. Since only a finite
weight of the new atom that is instantiated from number of topics have non-zero counts in the corpus, the
the Dirichlet process. Also, set h to the value of columns in X that have zero counts and their corresponding
hnew . h entries can be ignored, so making posterior computation
b) Sample mik , where k ranges over the topics, by tractable.
generating nik uniformly distributed random The posterior distribution for the parameters h is then a
variables u1 ; . . . ; unik between 0 and 1 and setting Gaussian distribution:
1 1
X
nik
tbk hN X> X þ zI X> y; X> X þ zI ; (20)
mik ¼ 1 um ; (14)
m¼1
tbk þ m
where z is the prior variance for the concentration parame-
where 1 is the indicator function. ters and I denotes the identity matrix.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 249
For prediction, topics are sampled for test documents as the maximum value of z, m and b are sampled from (14)
in (18). The empirical topic distribution is sampled over a and (15) respectively.
number of iterations with any topics that are instantiated or
any topics that are removed during this period ignored. The
remaining empirical topic distributions for each document
are averaged and used to calculate the expectation of the
response.
For the sHDP model, this is calculated as
X
D
z
hÞ /
lðh logð1 þ expðydh> zd ÞÞ h>h: (22)
d¼1
2
1
sðxÞ ¼ : (24)
1 þ expðxÞ
6 EXPERIMENTS
We place a Gaussian prior distribution on the regres-
sion coefficients, however since there is no conjugate We conducted experiments on four real-world data sets.
prior, the posterior distribution is not available in closed First, we considered the classification problem of determin-
form. To sample from the exact posterior for the coeffi- ing the effect of financial newswires on the direction of
cients, the Gibbs sampling method presented by Groene- change of the closing prices of a set of stocks. Second, we
wald and Mokgatlhe [19] and used in topic models by focused on the classification problem of determining
Mimno et al. [20] can be used. However, we found that whether movie review sentences are positive or negative.
this method took numerous iterations to converge for a Third, we addressed the regression problem of predicting a
given topic assignment due to the high number of coeffi- rating for a full movie review and fourth, the regression
cients. As a result, in our results, we instead sample from problem of predicting the popularity of a document. The
an approximation to the posterior. A common approxima- data sets were preprocessed to keep the terms with the
tion to use is the Laplace approximation, which involves highest total TF-IDF score. TF-IDF is a measure of how
sampling from a Gaussian centred at the MAP estimate of important a term is for a document in a corpus. The score is
the parameters with a covariance matrix that is the Hes- calculated as tfðwÞ log D=nw where tf is the frequency of
sian of the unnormalized log posterior. The limited-mem- the term w in the document, D is the number of documents
ory BFGS algorithm can be used to find the MAP estimate and nw is the number of documents where the term w
of the parameters [21]. occurs. This is summed across all the documents for each
For prediction, topics are sampled for test documents as term, and the highest scoring terms are kept.
in (18). The newswire classification data set consists of a set of
For the sHDP model, the distribution of the response is real-world newswires extracted from Reuters about the
given by stocks in the S&P 500 on different days over a year up to
May 2011. The newswires were labelled with the companies
h> E½zÞ
expðh that were mentioned in the wire. These labels were used so
pðyd ¼ 1jz; hÞ : (25)
h> E½zÞ
1 þ expðh that only newswires whose stocks on days that had more
than an 8 percent positive change or 3 percent negative
For simplicity, we also consider using the MAP estimate change from the previous day were considered. These cut-
of the parameters directly. In many cases we find there is offs were chosen so that the number of declining stocks
not a significant performance benefit of using parameter were similar to the number of rising ones, and to ignore
sampling over using the MAP solution directly. minor changes of prices due to other factors. This resulted
A sampling step of the supervised HDP algorithm that in a data set of 1,518 documents and a vocabulary of 1,895.
samples the regression coefficients (the sampled model) is The review snippet classification data set [22] consists of
shown below in pseudocode. To initialize, words are ran- reviews from the Rotten Tomatoes website with reviews that
domly allocated to topics so z is set randomly, K is set to were marked as fresh labelled as positive reviews and
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
250 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015
reviews that were marked as rotten labelled as negative iterations with the coefficients being sampled every itera-
reviews. The data set contains 5,331 positive snippets with tion. For predicting the responses of the test documents, 500
the same number of negative ones and a vocabulary of iterations of topic sampling were used to allow the inferred
4,310. topics to converge. The number of iterations was chosen by
The review snippet regression data set [22] consists of looking at the trace plots of the residuals and the regression
reviews written by four film critics where the writer addi- coefficients, which appeared to converge by that number of
tionally assigned a rating to his or her review. The ratings iterations. To compare our models, we carried out experi-
were normalised to be between 0 and 1. Any terms that ments using sLDA with variable numbers of topics so that
appeared in more than 25 percent of the documents were performance with sLDA with the best performing number
removed as were any terms that appeared fewer than five of topics on the test set can also be compared.
times. Only the remaining top 2,179 terms by TF-IDF score We show results for a sHDP inference algorithm that
were then kept. The ratings for each document were prepro- uses the MAP estimate of the regression coefficients (which
cessed to normalise the scores by applying a log transform. reduces computation time) during inference and uses a
There was a total of 5,005 documents. fixed set of coefficients at test time. We also show results for
The document popularity regression data set is a data set an algorithm that samples from the posterior of those coeffi-
of submission descriptions from the Digg website with the cients during training and test time. For the Gaussian
associated number of votes that each submission received. model, we sample from the posterior as in Eq. (20). Using a
The number of votes were again normalised by applying a Gibbs sampling method to sample the coefficients of the
log transform. The data set consisted of a vocabulary of binomial model took many iterations to converge so we
4,120 across 3,880 documents. sampled the coefficients from a Laplace approximation.
Experiments were performed with the sHDP model and Finally, we also do experiments with a two-step algorithm
the sLDA model. Both models were implemented using in which unsupervised topics for the documents are first
MCMC methods (collapsed Gibbs sampling in the case of learnt as in a HDP model and then a GLM model is trained
sLDA with the Chang [23] implementation) and predictions on top of the learnt topics to predict document labels. In this
were done using an equivalent sample in both instances. way, the performance of jointly training the topics and
For sLDA we also applied a variational approach with the the GLM model can be compared with training the two in
Wang [24] implementation, and the results for sLDA are separate steps.
given for both collapsed Gibbs and variational inference
approaches. For sLDA, for collapsed Gibbs sampling, 3,000
iterations were used and for variational inference, EM was 6.1 Results
ran until the relative change in the likelihood bound was Fig. 2 shows that the supervised HDP model performs sig-
less than 0.01 percent. nificantly better than the sLDA model on the newswire data
The accuracy for classification problems and predictive set. For almost all models, sLDA inference using Gibbs sam-
R2 for regression problems after five-fold cross-validation pling performs better than with variational EM, so we have
were calculated. Predictive R2 is defined as not shown the variational EM to avoid clutter. sHDP per-
P forms competitively against sLDA with the best performing
ðy^d yd Þ2 number of topics (as chosen on the test set) on the remaining
pR2 ¼ 1 Pd 2
; (26)
d ðyd yÞ
data sets except for the movie snippet data set. In the movie
snippet data set, sLDA outperformed the sHDP across the
where yd are the observed responses, with d ranging over number of topics. From the results it can be seen that for
P y^d is the response predicted by the model
the documents, sLDA, picking the right number of topics is key to getting
and y ¼ 1=D D d¼1 yd is the mean of the observed responses. good performance. Moreover, picking too few topics or too
This value gives the proportion of variability in the data set many in some cases can cause big drops in performance. On
that is accounted for by the model and is often used to eval- the other hand, for the sHDP, the model yields good perfor-
uate the goodness of fit of a model. A value of 1:0 is mance without having to pick the number of topics. For
obtained when the regression line perfectly fits the data. We sHDP, the results also show that sampling the regression
present accuracy and predictive R2 results that are calcu- coefficients from their posteriors make little difference to
lated on the full set of predictions. We also give indicators the results compared to using the MAP value of the coeffi-
of the minimum and maximum difference in performance cients. Additionally, the simple alternative of learning the
(across the folds) of each method, relative to the sampled topics unsupervised in a HDP model and then training a
HDP. GLM model on top (a two-step supervised approach) per-
In the experiments, the prior standard deviation of the forms significantly worse than jointly learning the topics
parameters z was tested with three values (1, 5 and 10) on with the sHDP model.
each fold’s training set by splitting the fold’s training set The better performance of sHDP compared to sLDA for
into a smaller training and validation set and choosing the the newswire data set and competitive performance with
best value on the validation set. This was also done when the other data sets is partly due to the increased flexibility
choosing the prior standard deviation for sLDA. aw for the of the model and better mixing during inference as can be
sHDP model was set to 0:01 on data sets similar to previous seen in Fig. 3. The increased flexibility comes from the
experiments with HDP. In the sHDP, the standard prior model having an infinite number of topics to model the
Gamma(1; 1) was placed on a and g and these are sampled documents and responses. The better mixing results from
during inference. For sHDP, learning took place over 2;000 the fact that during inference, clusters can be instantiated or
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 251
Fig. 2. Results for the test data sets after five-fold cross-validation. Classification results are given for (a) the newswire data set and (b) the movie
snippet data set. Regression results (R2 ) for the entire data set are given for (c) the movie reviews data set and (d) the document popularity data set.
sLDA Gibbs performance is shown for each number of topics. Variational EM performed as well as or worse than Gibbs sampling and is omitted for
space. For sHDP, the performance with MAP parameters, with parameters sampled from their posteriors and with a two-step supervised approach
where an unsupervised HDP model is learnt and a GLM model trained on top of that is shown. The upper and lower bars show the minimum and
maximum performance of each method relative to the performance of the sampled sHDP (minimum and maximum taken over the five folds). This
allows the reader to see whether a method performs better or worse than sampled sHDP across all the folds.
unneeded ones can be removed while sampling. Since addition, only closing stock prices are used, which means
newly instantiated clusters are empty, it is easier for words that it is possible there were changes in the stock price
to change topic and be allocated to a new cluster. In con- from the general movement of the industry or the market.
trast, in sLDA each topic almost always has a significant However, the sHDP is able to pick out these subtle signals
number of words allocated to it, making it difficult for the whereas sLDA with both types of inference algorithms
distribution of a topic to change. This has the effect of was unable to.
smoothing over term contributions for each topic. Thus, the Fig. 3 is a Gelman-Rubin-Brooks plot [25] which shows
fact that there are more specific topics in the sHDP model how Gelman and Rubin’s shrink factor [26] changes as the
helps to improve performance. number of iterations changes during inference. The plot of
From the relatively low accuracy scores and large stan- the shrink factor is calculated from four parallel MCMC
dard deviations, it can be seen that the labels on the news- chains from different starting points. We present results for
wire data set are much harder to predict than on the movie the Gaussian sHDP model and the Gibbs sampled sLDA
review data set. The standard deviations for the newswire model with 40 topics. The shrink factor is calculated by
scores imply that the data is much more noisy since comparing the within-chain and between-chain variances
newswires only indirectly influence stock movements. In for each variable of interest. The factor predicts that the
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
252 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015
Fig. 3. Chain convergence for (TOP) the sHDP model and (BOTTOM) the Gibbs sampled sLDA model with 40 topics. Gelman-Rubin-Brooks plots
show how Gelman and Rubin’s shrink factor for the L2 norm of the regression coefficients and the L2 norm of the residuals changes across iterations
during inference for the movie regression data set. This is shown for four parallel MCMC chains with different starting values. Values close to 1 indi-
cate convergence. From these graphs it can be seen that the Gibbs sampled sLDA model is slower to mix compared to the sHDP model.
chains have converged if the output from the chains are the three data sets with more coherent word usage pat-
indistinguishable, which is given by the factor approaching terns, but is less powerful than a sampled sHDP on docu-
1. In the plots, the shrink factors for the L2 norm of the ment popularity, where the word usage within a topic is
regression coefficients and the L2 norm of the residuals for more varied. The extra topic structure also enables an
the two models are shown. As can be seen from the plots, understanding of the topic-related dependencies which
the shrink factor for the sLDA model is significantly higher we expand on in the next section.
than that of the sHDP model, indicating the sHDP model
exhibits better mixing. 6.2 Analysis of Strong Topics and Terms
We also conducted experiments by regressing directly For the sHDP model, the top positive and negative topics,
on the empirical word distribution for each document in terms of their regression coefficients and their most fre-
with L1 regularized generalized linear models and the quent terms for the movie review problem, are shown in
GLMNET R package. The regularization parameter was Table 1. The topics do not generally correspond to themes
chosen through cross validation on the training set of such as film genre or style. Instead of this, the topics con-
each fold. The accuracies are 61 percent for newswire data tain many names such as actors and directors. This is
set, 75 percent for the movie snippet data set, and R2 of because the flexibility of a nonparametric model means
0:44 for the movie rating data set and 0:064 for the docu- that the top positive and negative topics consist of very
ment popularity data set. Hence this approach marginally few terms and are allocated to actors and directors that are
outperformed both sLDA and sHDP on newswire, movie consistently reviewed well or poorly. This flexibility results
snippet and movie review data sets, but it is outper- in strong topics that are grouped around consistently per-
formed by the sHDP on the document popularity data forming actors or directors but the topics are less coherent
set. Given the large number of available parameters, the since they are associated with so few documents. Topics
model flexibility provides benefits for the L1 GLM, but that consist of more terms, even if those are strong terms,
makes it particularly sensitive to particular word usage. It generally have smaller regression coefficients since the
is therefore understandable that this model does best on effect of the different terms is averaged over other words
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 253
TABLE 1 TABLE 2
The Most Positive and Negative Learnt Topics, in Terms of Their Strongest Topics for the Movie Review Regression Data Set
Regression Coefficients, from the Movie Review Regression Using the Alternative Two-Step Approach Where a Supervised
Data Set with sHDP GLM Model Is Learnt on Top of a Set of Unsupervised Topics
from the HDP
REFERENCES
[1] C. E. Antoniak, “Mixtures of Dirichlet processes with applications
to Bayesian nonparametric problems,” Ann. Statist., vol. 2, no. 6,
pp. 1152–1174, 1974.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet
allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[3] T. L. Griffiths and M. Steyvers, “Finding scientific topics,”
Proc. Nat. Aca. Sci. Am., vol. 101, no. Suppl. 1, pp. 5228–5235,
2004.
[4] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The
author-topic model for authors and documents,” in Proc. 20th
Conf. Uncertainty Artif. Intell., 2004, pp. 487–494.
[5] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” in
Advances in Neural Information Processing Systems 20, vol. 20,
no. 21, J. C. Platt, D. Koller, Y. Singer, and S. Roweis, Eds., Cam-
bridge, MA, USA: MIT Press, 2007, pp. 1–8.
[6] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
7 CONCLUSIONS Machine Learning. Cambridge, MA, USA: MIT Press, 2006.
[7] B. Shahbaba and R. M. Neal, “Nonlinear models using Dirichlet
We have presented a supervised Bayesian nonparametric process mixtures,” J. Mach. Learn. Res., vol. 10, no. 10, pp. 1829–
model that handles grouped data. Each group of data has 1850, Aug. 2009.
[8] L. Hannah, D. Blei, and W. Powell, “Dirichlet process mixtures of
an associated response such as sentiment ratings or docu- generalized linear models,” J. Mach. Learn. Res., vol. 12, pp. 1923–
ment popularity. The supervised HDP model learns latent 1953, 2011.
topics that are predictive of document responses without [9] D. Mimno and A. McCallum, “Topic models conditioned on arbi-
having to choose a fixed number of topics, a deficiency in trary features with Dirichlet-multinomial regression,” in Proc.
Uncertainty Artif. Intell., 2008, pp. 411–418.
previous models such as supervised LDA (sLDA). In those [10] J. Zhu, A. Ahmed, and E. P. Xing, “MedLDA: Maximum margin
models, overfitting or underfitting can occur if the number supervised topic models for regression and classification,” in
of topics is unsuitable for the data set. The strongest topics Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 1257–1264.
[11] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled
learnt in the sHDP are relatively finer-grained and are LDA: A supervised topic model for credit attribution in multi-
associated with fewer topics allowing their effect on the labeled corpora,” in Proc. Conf. Empirical Methods Natural. Lan-
document response to be learnt easily. Regression and clas- guage Process., 2009, pp. 248–256.
sification experiments were performed on real-world data [12] T. S. Ferguson, “A Bayesian analysis of some nonparametric prob-
lems,” Ann. Statist., vol. 1, no. 2, pp. 209–230, 1973.
sets and showed that the model performs better than [13] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical
sLDA on the newswire data set, and only doing worse Dirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476,
than sLDA on the movie snippet classification data set. The pp. 1566–1581, 2006.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 255
[14] P. J. Cowans, “Information retrieval using hierarchical Dirichlet Andrew M. Dai received the MA degree in com-
processes,” in Proc. 27th Annu. Int. ACM SIGIR Conf. Res. Develop. puter science from the University of Cambridge,
Inf. Retrieval, 2004, pp. 564–565. and the PhD degree from the School of Infor-
[15] W. Li, D. Blei, and A. McCallum, “Nonparametric Bayes matics, University of Edinburgh (focusing on
pachinko allocation,” in 23rd Conf. Uncertainty Artif. Intell., Bayesian nonparametric models). He is a soft-
2007, pp. 243–250. ware engineer at Google Inc., Mountain View.
[16] P. Mccullagh and J. Nelder, Generalized Linear Models, 2nd ed., His research interests include Bayesian nonpara-
series Monographs on Statistics & Applied Probability. London, metrics, topic modelling, Dirichlet processes,
U.K.: Chapman & Hall/CRC, Aug. 1989. machine learning, and collaborative filtering.
[17] A. M. Dai and A. J. Storkey, “The grouped author-topic model for
unsupervised entity resolution,” in Proc. 21st Int. Conf. Artificial
Neural Netw., 2011, no. 6791, pp. 241–249.
Amos J. Storkey received the MA degree in
[18] R. M. Neal, “Markov chain sampling methods for Dirichlet pro-
mathematics, a degree in theoretical physics
cess mixture models,” J. Comput. Graphical Statist., vol. 9, no. 2,
(Part III Maths) from Trinity, Cambridge. He
pp. 249–265, 2000.
received the MSc and PhD degrees (Bayesian
[19] P. C. Groenewald and L. Mokgatlhe, “Bayesian computation for
Gaussian processes and neural networks) from
logistic regression,” Comput. Statist. Data Anal., vol. 48, no. 4,
Imperial College, London both were focused on
pp. 857–868, 2005.
machine learning. He is a reader in the School of
[20] D. Mimno, H. Wallach, and A. McCallum, “Gibbs sampling for
Informatics, University of Edinburgh. Prior to his
logistic normal topic models with graph-based priors,” in Proc.
current post, he was research associate in the
NIPS Workshop Analyzing Graphs, 2008.
School of Informatics and Institute for Astronomy,
[21] T. P. Minka, “A comparison of numerical optimizers for logistic
Edinburgh and a Microsoft Research funded
regression,” Carnegie Mellon University, Tech. Rep., http://
fellowship in Informatics, Edinburgh. His research interests include
[Link]/en-us/um/people/minka/papers/logreg/
methods for incentivised distributed machine learning (e.g., Machine
[Link], 2003.
Learning Markets), inference and learning in continuous time systems,
[22] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for
and applications in imaging and medical informatics. He is an associate
sentiment categorization with respect to rating scales,” in Proc.
editor of the IEEE Trans. Pattern Analysis and Machine Intelligence
43rd Annu. Meeting Assoc. Comput. Linguistics, 2005, vol. 43, no. 1,
(TPAMI).
pp. 115–124.
[23] J. Chang, “lda: Collapsed Gibbs sampling methods for topic
models”, R package version 1.3.2., [Link] " For more information on this or any other computing topic,
package=lda, 2012. please visit our Digital Library at [Link]/publications/dlib.
[24] C. Wang, “sLDA EM implementation,” 2009, [Link]
[Link]/~chongw/slda/ with copyright at [Link]
edu/~chongw/slda/[Link]
[25] S. P. Brooks and A. Gelman, “General methods for monitoring
convergence of iterative simulations,” J. Comput. Graphical Statist.,
vol. 7, no. 4, pp. 434–455, Dec. 1998.
[26] A. Gelman and D. B. Rubin, “Inference from iterative simulation
using multiple sequences,” Statist. Sci., vol. 7, no. 4, pp. 457–472,
1992.
[27] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smooth-
ing and inference for topic models,” in Proc. 25th Conf. Uncertainty
Artif. Intell., 2009, pp. 27–34.
[28] C. Wang, D. M. Blei, and F.-F. Li, “Simultaneous image classifica-
tion and annotation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recog., 2009, pp. 1903–1910.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
The supervised HDP model mitigates overfitting and underfitting by not requiring a predefined number of topics, allowing for the dynamic adjustment of the inferred number based on the data complexity. This Bayesian nonparametric approach facilitates the emergence of the appropriate number of clusters needed for accurate prediction without excessive reliance on training data alone. Moreover, the model's increased mixing ability during inference allows for efficient topic reallocation, helping to capture relevant patterns while discarding unnecessary complexity. These strategies provide more robust handling of diverse data structures and content variations in predictive modeling .
Exchangeability in Bayesian models implies that the order of topics or data points does not affect the probabilistic structure of the model. In practical terms, this ensures that inference does not depend on the sequence of data input. However, when employing maximum a posteriori (MAP) inference instead of a full Bayesian approach, this property can break, resulting in symmetry breaking of parameter inference. This affects the consistency of topic labels and makes the inference sensitive to the approach used. Implementing an inference process that initially considers document content generation ensures more consistent and meaningful topic labels, thereby improving the reliability and interpretability of parameter inference .
The approach of using Generalized Linear Models (GLMs) on top of unsupervised topics differs in predictive accuracy by often underperforming compared to the supervised HDP model. While this two-step approach involves first learning unsupervised topics with a HDP model and then fitting a GLM, it generally yields less accurate predictions due to less integration between data modeling and response prediction. The supervised HDP, by contrast, jointly learns topics aligned with prediction targets, allowing more nuanced understanding and adjustment based on response variables. Experimental results showed that the supervised HDP model outperformed the two-step approach notably in the document popularity data set, indicating higher adaptability and detailed data pattern recognition .
The hierarchical structure in Hierarchical Dirichlet Process (HDP) models allows for a nested arrangement of topics, where each group can have its slightly perturbed distribution around a common base. This facilitates modeling more complex data hierarchies and adapts dynamically to the emergence of new patterns, making it especially effective in variable-topic applications without preset constraints. In contrast, non-hierarchical models like Latent Dirichlet Allocation (LDA) have a fixed structure that necessitates predefining the number of topics, which can limit flexibility and adaptation. As a result, HDP models can offer more nuanced topic inference and better generalization in diverse datasets but with the trade-off of increased computational complexity .
Supervised variations of topic models, such as supervised LDA (sLDA) and supervised Hierarchical Dirichlet Process (sHDP), tackle the limitations of traditional topic models by learning topics that are relevant for prediction tasks. While LDA reduces dimensionality by identifying general corpus themes, sLDA incorporates document responses in the topic learning process, aligning topics more closely with prediction outcomes. However, sLDA still requires a fixed number of topics. The sHDP goes further by introducing a Bayesian nonparametric approach, allowing for a potentially infinite number of latent clusters and automatically adjusting the number of topics to effectively predict document responses. This flexibility helps mitigate overfitting or underfitting issues that can arise with a predefined topic number in predictive tasks .
The benefits of adopting a hierarchical Bayesian approach, such as the Hierarchical Dirichlet Process (HDP), include the ability to model data flexibly with a potentially infinite number of topics or clusters. This allows for automatic adjustment to the underlying complexity of the data without the need to set a predefined number of topics. It addresses the risk of overfitting and underfitting associated with fixed-structure methods. However, the drawback includes increased computational complexity and the challenge of setting prior distributions correctly. Additionally, while more flexible, hierarchical models may require more data for accurate inference compared to simpler models .
The supervised Hierarchical Dirichlet Process (sHDP) model improves predictive performance over supervised Latent Dirichlet Allocation (sLDA) by allowing a flexible topic structure which can adapt to the complexity of the data without a fixed number of topics. This flexibility is particularly advantageous in handling noisy data sets since sHDP can instantiate or remove topics as needed, leading to better detection of subtle data patterns and signals. For example, on the newswire data set, which is significantly noisier due to indirect influences on stock movements, the sHDP model can effectively distinguish relevant topics, whereas sLDA struggles due to its static topic structure .
In Generalised Linear Models (GLMs), the canonical link function is the choice of link function that sets the linear predictor as the natural parameter of the exponential family distribution. Its significance lies in the mathematical convenience and interpretability it provides, as it leads to simplified computation of maximum likelihood estimates for the model parameters. For example, using the canonical logit link function in logistic regression ties the log odds of the probability of success linearly to the predictors. This makes the model fitting and interpretation straightforward and efficient .
The Hierarchical Dirichlet Process (HDP) extends the Latent Dirichlet Allocation (LDA) model by allowing for a countably infinite number of multinomial draws, translating to an infinite number of topics that can grow or shrink according to the data. This solves the problem of determining the best number of topics in LDA and reduces overfitting or underfitting as the number of topics is not fixed but can adapt to the complexity of the data .
Generalised Linear Models (GLMs) offer flexibility in statistical modeling by allowing the analysis of data where the response cannot be assumed to be linear or normally distributed. GLMs include a linear predictor, a link function relating the predictor to the mean response, and a distribution from the exponential family for the response. This enables their application in scenarios like binary outcomes with logistic regression, through which a binomial distribution and logit link function are used. A Gaussian assumption might be inappropriate in cases like binary outcomes, for instance, predicting yes/no responses in a survey, as the normal distribution cannot capture the discrete nature of the data .