0% found this document useful (0 votes)

17 views13 pages

Supervised Hierarchical Dirichlet Process

The document introduces the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model designed for predicting responses associated with grouped observations. It compares the sHDP with the supervised latent Dirichlet allocation (sLDA) model, demonstrating that sHDP outperforms or matches sLDA in various real-world classification and regression tasks. The sHDP model addresses limitations of fixed-topic models by allowing for an infinite number of topics, thus enhancing flexibility and predictive accuracy in supervised learning scenarios.

Uploaded by

mengqi.chen.24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Supervised Hierarchical Dirichlet Process

Uploaded by

mengqi.chen.24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO.

2, FEBRUARY 2015 243

The Supervised Hierarchical Dirichlet Process

Andrew M. Dai and Amos J. Storkey

Abstract—We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint
distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with
another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our
method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models
based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored;
these models allow flexibility in modelling nonlinear relationships. However, until now, hierarchical Dirichlet process (HDP) mixtures
have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped
data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt
jointly from the group structure and from the label assigned to each group.

Index Terms—Bayesian nonparametrics, hierarchical Dirichlet process, latent Dirichlet allocation, topic modelling

1 INTRODUCTION

B AYESIAN nonparametric models allow the number of

model parameters that are utilised to grow as more
data is observed. In this way the structure of the model can
words). The topic distributions are shared across a number
of the documents. Each topic can be thought of as a group
of semantically related words, and inferred topics shed light
adapt to the data. A Dirichlet process (DP) mixture model on the common themes that run through the documents.
[1] is a popular type of nonparametric model that has an Topic models of this form are mixed-membership models
infinite number of clusters. DP mixtures are trained in an since each document consists of a mixture of topics in differ-
unsupervised manner and are frequently used for problems ent proportions. Topic models have been successful in ana-
that require model adaptation to different data sizes or lysing collections of documents, including abstracts from
where more and more new components are likely to be rep- citation databases [3] and newsgroup corpora. They can
resented in the data as the data size increases. also be used for a wide range of applications including data
In this paper, we describe a new nonparametric super- exploration, authorship modelling [4] and information
vised model for grouped data that utilises topics, where retrieval. The latent topics that are learnt are particularly
topics are distributions over data items that are shared important when modelling large document collections as
across groups. We analyse the performance of the model they can reduce the dimension of the data.
using experiments on both regression and classification Recently, attention has turned to these models as ways of
tasks. The problems of regression and classification are performing regression and classification on collections of
ubiquitous and related; both involve labelled examples. documents, where each document possesses an associated
Each example takes the form of a pair consisting of a pre- response. The response can be categorical, continuous,
dictor, also known as input, covariate or independent vari- ordered or of some other type. For example, the response
able, and a response, also known as output or dependent could be a sentiment rating. A simple approach to the prob-
variable. The set of examples is then used as data to inform lem of modelling document responses is to use topic models
models that predict the responses for test examples where as a dimensionality reduction method and then to regress
the response is unknown. on the resulting lower dimensional data set. A set of topics
Topic models such as latent Dirichlet allocation (LDA) [2] is learnt for the corpus using a topic model while ignoring
are unsupervised models of grouped data, where the topics the document responses. Then the document responses are
are distributions that are shared across the groups. A typical regressed on the empirical topic distribution for each docu-
example of such data is the text in the documents of a cor- ment. However, this approach performs poorly in contrast
pus. In this context, each group is a document, and the to directly regressing on the empirical word distribution for
topics are distributions over a vocabulary of terms (e.g., each document [5]. The topics that are learnt also often have
no relation to the responses that need to be predicted. As a
result, the words that cause positive responses and those
A.M. Dai is with Google Inc., 1600 Amphitheatre Parkway, Mountain that cause negative responses end up being assigned to the
View, CA 94043. E-mail: adai@[Link].
A.J. Storkey is with the Institute for Adaptive and Neural Computation,
same topic. This difficulty has spurred interest in super-
School of Informatics, University of Edinburgh, 10 Crichton Street, Edin- vised topic models that can learn topics that are both good
burgh EH8 9AB, United Kingdom. E-mail: [Link]@[Link]. models of document contents and are good predictors for
Manuscript received 16 Sept. 2012; revised 22 Nov. 2013; accepted 17 Feb. document responses.
2014. Date of publication 6 Apr. 2014; date of current version 14 Jan. 2015. Supervised topic models (sLDA) [5] are an extension of
Recommended for acceptance by R.P. Adams, E. Fox, E. Sudderth, and Y.W. Teh. LDA [2]. The topics that are learnt are more useful for pre-
For information on obtaining reprints of this article, please send e-mail to:
reprints@[Link], and reference the Digital Object Identifier below. dicting a document’s response than those obtained in unsu-
Digital Object Identifier no. 10.1109/TPAMI.2014.2315802 pervised LDA. This is because the learnt topics are oriented
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
244 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

around terms that are predictive of document responses. specifically for grouped data, and we go on in Section 3.2 to
For example, in sentiment analysis tasks, the topics learnt give an introduction to generalised linear models (GLMs),
consist of terms that cause the document to have positive or and review the sLDA model (Section 3.3), both of which are
negative sentiment. Similarly, for financial news, the topics important in the later parts of the paper. We then introduce
consist of terms that have positive or negative effects in the the supervised HDP model in Section 4. Section 5 describes
market. In contrast, unsupervised LDA learns topics that the inference algorithms that are used to sample from the
are in line with the general theme of the documents, but are posterior of the new model. Finally, Section 6 covers experi-
often unrelated to the document responses. Blei and ments with this model on real-world data sets consisting of
McAuliffe [5] found that the predictions made by sLDA for both binary and continuous responses and compares the
the responses of an unseen test set were better than the pre- new model to existing models.
dictions made using the unsupervised topics inferred by
LDA. However, they found that the sLDA model only per- 2 PROBLEM DESCRIPTION
formed slightly better than LASSO regression on the empiri-
In this section we outline the structure of the problems for
cal distribution of words for each document.
which this work is relevant. First, we assume that there is a
Although supervised topic models perform well, they are
set of data points divided into D groups. Second, to reduce
limited as the number of topics in the model must be fixed
complexity, we should be willing to assume a bag of words
in advance. This can lead to overfitting in sLDA when there
representation can be used for each group, which amounts
are too many topics and regression parameters in the model
to assuming exchangeability among the observations within
so that topics are relatively specific and do not generalise
a group. Each group i consists of both a variable number of
well to unseen observations. Underfitting is the opposite
data points xij ; j ¼ 1; . . . ; Ni , which are the predictors, and a
case when there are too few topics and regression parame-
single response yi . Given a set of training examples with
ters in the model so unrelated observations are assigned
predictors and associated responses, the task is to predict
together to the same topic. Another characteristic is the rela-
the responses on a separate test set of predictors. In the case
tive contribution of the supervised and unsupervised com-
of document modelling, D is the number of documents in
ponents to the model. In a fixed parametric setting it can be
the corpus, each word uses one-of-V encoding
that one or the other (typically the unsupervised part) of
xij 2 f1; . . . ; V g where V is the size of the vocabulary of the
these signals may dominate the likelihood, which deter-
corpus. yi is the response for the document, such as a rating
mines the topic features. In a nonparametric setting, any
or a category. In the rest of this paper, the problem and
dominant individual signal is captured by a set of compo-
models will be described in terms of documents and words,
nents, leaving the remaining joint topic and supervised sig-
but all the models can also be used on other kinds of
nal to be captured by as many additional components as are
grouped data.
relevant.
A number of methods can be used to choose the num-
ber of topics, including cross-validation and model com- 3 BACKGROUND
parison techniques; however, these are slow as the In this section we outline previous work and other methods
algorithm has to be restarted a number of times and that will be used in this paper. Many of these approaches
choosing the ideal number of topics from the runs can be utilise Bayesian nonparametric models to gain more flexibil-
difficult. Bayesian nonparametric methods have emerged ity than parametric models.
as a good way to extend these models naturally to handle Due to their flexibility, there has been interest in super-
a flexible number of topics. vised nonparametric models, such as the regression models
The nonparametric supervised HDP (sHDP) model is of Gaussian processes (GPs) [6] and Bayesian regression
presented in this paper. The sHDP model is a generative trees. Dirichlet processes have also been adapted for super-
supervised model that has an infinite number of topics (or vised problems. An example of this is the Dirichlet process
clusters) that can be used to predict a document response. multinomial logit model (dpMNL) [7]. In this generative
The sHDP model is a nonparametric extension of the super- model, the relationship between the covariates and
vised topic model (sLDA) [5]. The main contribution of the responses are modelled jointly using Dircihlet process mix-
model is that it overcomes the issue of choosing the fixed tures. Although within each cluster the relationship is
number of topics that is necessary for sLDA. The fact that assumed to be linear, an overall nonlinear relationship
the model has an infinite number of topics also reduces the occurs when the model has more than one cluster. A multi-
problems of underfitting and overfitting. The sHDP can also nomial logit is used to model the responses conditionally on
be considered a supervised extension of the HDP mixture the covariates within each cluster. Thus, the regression
model described in Section 3.1. In this paper, we show that parameters of the logit model are different for each cluster.
sHDP performs better than sLDA on one data set or compa- The predicted responses are conditional on the parameters
rable to sLDA with the best performing number of topics and the covariates. The dpMNL model was tested on pro-
(chosen post-hoc) on two out of three other data sets (see, tein fold classification, and compared with existing methods
e.g., Fig. 2). based on neural networks and support vector machines.
The rest of the paper is organised as follows. Section 2 The results showed that the dpMNL model performed sig-
sets the problem and the form of the data for the models nificantly better.
proposed in this paper. Then in Section 3, we briefly review The dpMNL has been extended to model additional
some existing work on tackling the supervised learning response types with DP mixtures of generalised linear mod-
problem with nonparametric models and also approaches els (DP-GLM) [8]. Whereas the dpMNL only explicitly
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 245

models discrete responses, the DP-GLM can generatively G0 jg; H DPðg; HÞ; (1)
model both continuous and discrete responses using differ-
ent generalised linear models. Again, the regression coeffi- where H is the base probability measure and g is the con-
cients of the generalised linear models are different for each centration parameter.
cluster. Priors are also placed on the coefficients, resulting The random measures for each group i are conditionally
in a regularised model for the response. The model was independent given the global measure
shown to have weak consistency by Hannah et al. [8], and
Gi ja0 ; G0 DPða0 ; G0 Þ; (2)
the performance was shown to be comparable to a Gaussian
process model. where a0 is a concentration parameter. The distribution G0
Neither the dpMNL nor the DP-GLM has, to our varies around H by an amount controlled by g and the dis-
knowledge, been applied to the problem of predicting the tribution Gi in group i varies around G0 by an amount con-
responses of groups of observations. The supervised topic trolled by a0 . This can be seen as adding another level of
model (sLDA) is one approach to tackling this prediction smoothing on top of DP mixture models. Let ui1 ; ui2 ; . . . be
problem for grouped data (e.g., documents). sLDA learns i.i.d. variables distributed to Gi and each of these variables
topics that are able to model the document responses is a parameter that corresponds to an observation xij , the
more accurately. The sLDA model has, however, limited likelihood of these observations being
flexibility since the number of latent topics must be fixed
in advance leaving it at risk of overfitting or underfitting. uij jGi Gi ; (3)
There has also been work on other methods of learning xij juij F ðuij Þ; (4)
the regression coefficients or other response types such as
DMR [9], MedLDA [10] and labeled LDA [11], however, where F ðuij Þ is the distribution of xij given uij . This prior
these models still have a fixed number of topics. results in a DP being associated with each group in the
Hierarchical Dirichlet process (HDP) mixture models, model where the DPs are conditionally independent given
described in Section 3.1, are a type of Bayesian nonparamet- their parent and the parameters drawn in the parent node
ric model that can be used instead of LDA for topic model- are shared among the descendant groups. This structure
ling. They are commonly used as the nonparametric analog can be extended to multiple levels.
to LDA, allowing for flexible topic modelling without being The HDP requires that the data be in a pre-defined
restricted to a fixed number of topics. Though inference is nested structure. The HDP model has been used in informa-
more complex, Gibbs sampling and variational Bayes tech- tion retrieval tasks and used in relation with traditional TF-
niques can still be applied. Until now, HDP mixtures have IDF measures [14] for measuring the score of documents in
not seen significant use in supervised problems and suffer relation to a query. There are variants of HDP that model
the same problems as unsupervised LDA in that the topics topics for documents where there is no predefined hierar-
learnt are not necessarily predictive of the responses. The chical structure (see, e.g., [15]).
sHDP model we present in this paper extends the HDP mix-
ture model to learn topics that are good predictors of docu-
3.1.1 Similarity to LDA
ment responses.
With the appropriate base measure, the HDP can be thought
3.1 Hierarchical DPs of as the infinite analogue of LDA. In the HDP, the base
A DP [1], [12] is a stochastic process that can be thought of probability measure allows for a countably infinite number
as a probability distribution on the space of probability of multinomial draws and so an infinite number of topics.
measures. The name of the process accurately describes the This allows the number of topics to grow or shrink accord-
fact that the DP results in finite-dimensional Dirichlet mar- ing to the data. This solves the problem of finding the best
ginal distributions, similar to the Gaussian process that has number of topics in LDA and reduces the problems of over-
Gaussian distributed finite-dimensional marginal distribu- fitting or underfitting due to a fixed number of topics.
tions. DPs are commonly used as a prior on the space of
probability measures, which give wider support and so 3.2 Generalised Linear Models
improved flexibility over using traditional parametric fami- Often when a response is not an unconstrained continuous
lies as priors. In addition, DPs also have tractable posteriors variable, it is transformed into one and a normal linear
so making them important in Bayesian nonparametric prob- model is used for it. However, this may not always be
lems. A DP is defined in terms of a base measure and a con- appropriate. A generalised linear model [16] expands the flexi-
centration parameter. Each draw from the DP is itself a bility of linear regression by being capable of analysing data
measure. Since there is a positive probability of drawing a where either there may not be a linear relation between the
previously drawn value, the draws are discrete with proba- covariates x and the response y or where a Gaussian
bility 1. This makes them very useful for clustering in DP assumption for y is inappropriate. Given parameters h, and
mixtures. covariates x, a generalised linear model is specified by a lin-
The HDP [13] is a hierarchical extension to DPs. The ear predictor which we denote in this section by r ¼ hT x, a
hierarchical structure provides an elegant way of sharing link function gðÞ that relates the linear predictor to the
parameters. This process defines a set of probability mean m of the response m ¼ g1 ðrÞ and a probability distri-
measures Gi for D pre-specified groups of data and a bution from the exponential family that gives the distribu-
global probability measure G0 . The global measure is dis- tion of the response y with mean EðyjÞ ¼ m. In this paper,
tributed as we only consider canonical link functions though others can
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

be used when needed. The canonical link function is a coefficients. Exchangeability of the topic assignments
choice of link function such that r is the natural parameter imply posterior parameter symmetries in the GLM model,
in the exponential family distribution. The distribution of were a full Bayesian solution obtained. However if we are
the response may also be an exponential dispersion family constrained to a MAP inferential setting there is no possi-
that has an additional dispersion parameter denoted as d. bility of parameter symmetry and so there is symmetry
We denote this as ExpFamðm; dÞ. The generalised linear breaking of the exchangeability of topic assignments. The
model for response y takes the form inference process used must be sensitive to this broken
symmetry. An inference process that considers the genera-
ry AðrÞ tion process for the document contents first enables consis-
pðyjr; dÞ ¼ hðy; dÞ exp ; (5)
d tent topic labels to be determined. Then the document’s
response is chosen conditional on those contents, and
where AðrÞ the log-normaliser.
hence on topic labels that have a consistent meaning. An
Different forms of responses can be modelled using dif-
alternative to this is to choose a model where y is regressed
ferent choices of h and A. In particular, there is a Gaussian
on the topic proportions for the document, #. However,
distribution on y,
this may result in some topics being estimated that just
explain the response variables while other topics only
1 1
pðyjr; dÞ ¼ pffiffiffiffiffiffiffiffi exp ðy rÞ2 ; (6) explain the document words.
2pd 2d
In the sLDA model, the parameters a, f1:K , h and d are
pffiffiffiffiffiffiffiffi 2 treated as constants to be estimated. Approximate maxi-
when hðy; dÞ ¼ ð1= 2pdÞey =2 and AðrÞ ¼ r2 =2. This is a
mum-likelihood estimation is then performed with a varia-
normal linear model with a mean of r and variance of d.
tional expectation-maximisation (EM) method, similar to
When y is binary, a binomial distribution can be used
that for LDA. Collapsed Gibbs sampling can also be used
with the number of trials n ¼ 1, so that y is distributed as
for inferring the topics jointly as in LDA.
The models we propose in this paper solve the issue
pðyjrÞ ¼ ry ð1 rÞ1y ; (7)
sLDA has of requiring the number of topics to be fixed from
which uses the canonical logit link function gðrÞ ¼ the start. This can result in overfitting or underfitting if the
lnðr=ð1 rÞÞ and the binomial distribution for y. This number of topics is unsuitable for the data set. Though the
choice of distribution and link function results in a logis- number of topics can be chosen based on a training set, the
tic regression model. process can be difficult and time consuming.

3.3 The Supervised Topic Model 4 THE SUPERVISED HDP MODEL

The supervised topic model (sLDA) [5] is an extension of
The supervised HDP model proposed in this paper can
LDA to supervised problems. It partially overcomes the
automatically learn the necessary number of topics to model
problem that the topics that are learnt cannot be controlled
the responses of documents on training data. It is a Bayesian
in the LDA model. The learnt topics in LDA act to reduce
nonparametric model so that a potentially infinite number
the dimension of the data but may not be predictive of a
of latent clusters can be used for prediction. The sHDP
document’s response as they will correspond to the general
model extends the HDP mixture model to learn clusters
themes of the corpus. sLDA overcomes this problem by
that align with document responses. The relationship
jointly learning topics and their regression coefficients for
between the data points and the responses is modelled with
the document responses. The response for a document is
a generalised linear model on the clusters to which the data
predicted by averaging over the empirical topic allocations
points in a document have been allocated. A regression
for a document.
coefficient is associated with each cluster, and the doc-
The generative process for each document i is the follow-
ument’s response is regressed on the mean of these
ing. Let K be the fixed number of topics, Ni the number of
coefficients.
words in document i, f1:K the topics where each f is a distri-
In the sHDP model, unlike sLDA, the number of topics
bution over the vocabulary, a a parameter for topic propor-
does not need to be fixed in advance. This is beneficial in
tions, and h and d the response parameters:
supervised problems since it is unclear how many latent
1) Draw topic proportions #i DirichletðaÞ. topics will be necessary to model the data and the response
2) For each word (enumerated by j) conditional on the document. The response is modelled by
a generalised linear model conditioned on the topics that
a) Draw a topic assignment zij Multinomialð# #i Þ. have been assigned to the observations in the document.
b) Draw a word wij jzij Multinomialðf fzij Þ. Since the number of instantiated topics can vary and each
3) Draw the document response yjzi;1:Ni ; h; s 2 topic has a regression coefficient, the number of instantiated
PNi
ExpFamðg1 ðhh> zi Þ; dÞ where zi ¼ 1=Ni j¼1 zij . regression coefficients also varies given the current number
This implements a GLM for the document responses: of instantiated topics. In the generative process, a regression
ExpFam is a distribution from the exponential family, g is coefficient is sampled for each topic in addition to sampling
the link function and d is the dispersion parameter for the a distribution over the vocabulary. In effect, a product base
distribution. The linear predictor in the GLM model for measure is used for the topics where one component is a
the response is h> z where z are the empirical frequencies prior over the vocabulary and the other is a prior for the
of the topics in the document and h are the regression regression coefficient. This treats the regression coefficients
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 247

as random variables, whereas in sLDA, the regression coef-

ficients are treated as constants. This modelling of the
regression coefficients results in a regularised regression
model for the response variables. Each topic can also be
assigned a vector of regression coefficients for categorical
responses.
The model is thus

G0 DPðgHÞ (8)

Gi DPðaG0 Þ (9)

ij ; u ij Gi
uij ¼ uX Y
(10)

X
xij juX
ij f uij (11)

yi juuYi ExpFamðg1 uYi ; dÞ; (12)

P
where uYi ¼ ð1=Ni Þ j uYij is the linear predictor for the GLM,
g is its link function and d is the dispersion parameter for the
exponential family distribution. i ranges over each docu-
Fig. 1. The supervised HDP model where the observed variables are the
ment, j ranges over each observation in that document, g
words wij denoting word j in document i and the document label yi .
denotes the concentration parameter for the corpus-level
DP and a denotes the concentration parameter for the docu-
a) Draw a distribution over topics Ti DPða; T0 Þ.
ment-level DP. The base measure H ¼ H Y H X consists of
b) For each word wij ,
a measure for the regression parameters uY H Y and
another measure for the topic parameters uX H X . G0 is i) Draw a topic ðuuX ij ; uij Þ Ti .
Y

the corpus-level random measure that acts as the base mea- ii) Draw a word w MultinomialðuuX ij Þ.
sure for the document-level random measure Gi . c) Draw a response for the document y ExpFam
P
Due to the clustering property of the DP, some data ðg1 ðuuYi Þ; dÞ where uYi ¼ ð1=Ni Þ j uYij .
points will share the same parameters u, which can be repre- The sHDP learns topics that both model document con-
sented as those data points being assigned to the same topic. tents well and are predictive of document responses with-
The prior density for the regression parameters is typically out the need to choose a fixed number of topics beforehand.
H Y ¼ Nð0; zÞ. For topic modelling, the documents consist of This structured approach to supervision allows the model
words, and the prior density for the cluster parameters is to be easily extended to incorporate additional information
H X ¼ Dirichletðaw Þ, where aw is the parameter for a sym- from documents to aid in predicting the response such as
metric Dirichlet distribution. f is the likelihood of uX given the authors of a document or the research group which
the observations x. In a topic modelling problem, authored a document, which can be inferred through the
fðuX Þ ¼ MultinomialðjuX Þ. When coupled with its conjugate grouped author-topic model [17]. For example, there is the
prior, the Dirichlet distribution, the topic parameters uX can problem of predicting the venue where a paper is published
be integrated out, allowing for collapsed Gibbs inference to by learning the venues where the research group has previ-
be performed by just keeping track of the word to topic allo- ously published. Another example could be the problem of
cations and the regression coefficients for the topics. The predicting a set of keywords or categories for a paper by
GLM model for the responses allows the responses to be learning which categories have previously been picked by
continuous, ordinal, categorical and other types depending the research group for those topics. Allowing for the topics
on the form of the GLM. If the base measure for the coeffi- to be supervised can also give more control over the types
cients H Y is chosen to be Gaussian, the maximum a posteri- of topics that are learnt by the sHDP in case the unsuper-
ori (MAP) solution for the coefficients is similar to the vised topics are not interesting for a particular task. Finally,
solution for L2 penalised regression. A graphical model is the sHDP model allows for unlabelled data to be used as
shown in Fig. 1. part of the training set in semi-supervised problems. This
The generative process for the full model is: allows supervised topics to be learnt that take into account
the content of unlabelled documents so that the topics can
1) Draw (from their prior distributions) the concentra-
better model the entire corpus instead of just the labelled
tion parameters, g, for the global DPs. Likewise,
documents.
draw the concentration parameters, a, for the lower-
level DPs from their priors.
2) Draw a global distribution over topics and their 5 INFERENCE
regression coefficients T0 DPðg; HÞ. Since posterior inference is intractable in DP-based models,
3) Now for each document i, approximations must be used. Collapsed Gibbs sampling is
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
248 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

the most common technique used to sample from the poste- 2) Sample b from
riors of these models, and it can also be applied to the model
described in this paper. For topic modelling problems, the ðb1 ; . . . ; bK ; bnew Þ Dirichletðm1 ; . . . ; mK ; gÞ: (15)
Dirichlet base measure for each topic, which is a distribu-
For a continuous response assuming g ¼ 1,
tion over the vocabulary, is conjugate to the multinomial
likelihood for the words. This enables the topic parameters pðyd jz; hÞ / expððyd h> zÞ2 Þ; (16)
to be integrated out. Thus at each iteration and based on the
Chinese restaurant process, collapsed Gibbs sampling can and for a binomial response where yd 2 f0; 1g,
be used to sample the topic allocations. The regression coef-
ficients can then be sampled from their posteriors in some h> zÞyd ð1 h> zÞ1yd :
pðyd jz; hÞ ¼ ðh (17)
cases such as for a Gaussian response and can be approxi-
mated in other cases. The following sections describe infer- During prediction, the posterior of z is needed over the
ence in the proposed sHDP model. test documents. This is calculated by removing the terms
that depend on the response y from the conditional distribu-
5.1 The sHDP Model tions so that inference on the test documents is identical to
unsupervised sHDP. The posterior for the test samples can
Since the base measure for the topic regression coefficients
be sampled by replacing (13) with
will not in general be conjugate to the response model, the
non-conjugate auxiliary variable sampling algorithm 8 ij wij
>
< ðnik þ abk Þfk ðwij Þ;
(Algorithm 8) described by Neal [18] is used to sample the
topic allocations, zij where zij ¼ k indicates that word wij pðzij ¼ kjzij ; wij ; bÞ / if k ¼ zi0 j0 for some ði0 ; j0 Þ 6¼ ði; jÞ;
>
:
is allocated to topic k. The main difference from inference abnew fnew ðwij Þ; if k ¼ knew
for the HDP mixture model is in sampling the topic alloca- (18)
tion variable and the topic regression coefficients. The con-
ditional distribution for the topic allocation has an and sampling the allocations and counts for the test
additional term for the conditional likelihood of the topic documents.
parameters given the document response. Gibbs sampling
proceeds as below: 5.2 Parameter Posteriors and Prediction
The topic regression coefficients are sampled after each
1) For each document i, round of sampling the topic assignments. We also per-
a) Let nik denote the number of words in document formed experiments where the topic assignments were sam-
i allocated to topic k, and let a superscript ij pled for several rounds in between sampling the regression
for a variable denote the terms excluding the coefficients but this made little difference to prediction per-
ijth term. For each word wij , sample the topic formance. The topic coefficients can be updated for sHDP
allocation zij using by regressing only on the topics that are allocated to at least
one observation. We will describe cases for a Gaussian and
pðzij ¼ kjzij ; wij ; bÞ binary response in this section, though other models for the
8 ij w response can be used too.
>
> ðnik þ abk Þfk ij ðwij Þpðyi jzij ; zij ¼ k; hÞ;
>
< if k ¼ z 0 0 for some ði0 ; j0 Þ 6¼ ði; jÞ
ij (13)
/ 5.2.1 Gaussian Model
>
> ij
ab fnew ðwij Þpðyi jz ; zij ¼ k; h Þ; new
>
: new new In the Gaussian model, we place a Gaussian prior on the
if k ¼ k ; regression coefficients. The model response can be rewrit-
new new ten as
where hnew ¼ ðh h; hk Þ, hk Nð0; zÞ, fk is the
distribution of the word given the other words y ¼ Xh
h þ c; (19)
allocated to topic k and fnew is the probability of
the word in an empty topic. where y is a length-D vector of document responses, X is a
If a new topic knew is sampled during one of D 1 matrix of cluster to document allocation counts, h is
the steps above, then draw b Betað1; gÞ, set the a vector of regression parameters for each topic and c are
new weight bknew ¼ bbnew and set the new bnew to the residuals. Let X be the matrix where row d is the empiri-
ð1 bÞbnew . The value b corresponds to the cal topic distribution for document d. Since only a finite
weight of the new atom that is instantiated from number of topics have non-zero counts in the corpus, the
the Dirichlet process. Also, set h to the value of columns in X that have zero counts and their corresponding
hnew . h entries can be ignored, so making posterior computation
b) Sample mik , where k ranges over the topics, by tractable.
generating nik uniformly distributed random The posterior distribution for the parameters h is then a
variables u1 ; . . . ; unik between 0 and 1 and setting Gaussian distribution:
1 1
X
nik
tbk hN X> X þ zI X> y; X> X þ zI ; (20)
mik ¼ 1 um ; (14)
m¼1
tbk þ m
where z is the prior variance for the concentration parame-
where 1 is the indicator function. ters and I denotes the identity matrix.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 249

For prediction, topics are sampled for test documents as the maximum value of z, m and b are sampled from (14)
in (18). The empirical topic distribution is sampled over a and (15) respectively.
number of iterations with any topics that are instantiated or
any topics that are removed during this period ignored. The
remaining empirical topic distributions for each document
are averaged and used to calculate the expectation of the
response.
For the sHDP model, this is calculated as

E½yjz; h h> E½z: (21)

5.2.2 Binomial Model

For the logistic regression GLM model, the likelihood is

X
D
z
hÞ /
lðh logð1 þ expðydh> zd ÞÞ h>h: (22)
d¼1
2

The gradient is then

X
hÞ ¼
rh lðh ð1 sðydh> zd ÞÞyd zd zh
h; (23)
d

where sðÞ is the logistic sigmoid function,

1
sðxÞ ¼ : (24)
1 þ expðxÞ
6 EXPERIMENTS
We place a Gaussian prior distribution on the regres-
sion coefficients, however since there is no conjugate We conducted experiments on four real-world data sets.
prior, the posterior distribution is not available in closed First, we considered the classification problem of determin-
form. To sample from the exact posterior for the coeffi- ing the effect of financial newswires on the direction of
cients, the Gibbs sampling method presented by Groene- change of the closing prices of a set of stocks. Second, we
wald and Mokgatlhe [19] and used in topic models by focused on the classification problem of determining
Mimno et al. [20] can be used. However, we found that whether movie review sentences are positive or negative.
this method took numerous iterations to converge for a Third, we addressed the regression problem of predicting a
given topic assignment due to the high number of coeffi- rating for a full movie review and fourth, the regression
cients. As a result, in our results, we instead sample from problem of predicting the popularity of a document. The
an approximation to the posterior. A common approxima- data sets were preprocessed to keep the terms with the
tion to use is the Laplace approximation, which involves highest total TF-IDF score. TF-IDF is a measure of how
sampling from a Gaussian centred at the MAP estimate of important a term is for a document in a corpus. The score is
the parameters with a covariance matrix that is the Hes- calculated as tfðwÞ log D=nw where tf is the frequency of
sian of the unnormalized log posterior. The limited-mem- the term w in the document, D is the number of documents
ory BFGS algorithm can be used to find the MAP estimate and nw is the number of documents where the term w
of the parameters [21]. occurs. This is summed across all the documents for each
For prediction, topics are sampled for test documents as term, and the highest scoring terms are kept.
in (18). The newswire classification data set consists of a set of
For the sHDP model, the distribution of the response is real-world newswires extracted from Reuters about the
given by stocks in the S&P 500 on different days over a year up to
May 2011. The newswires were labelled with the companies
h> E½zÞ
expðh that were mentioned in the wire. These labels were used so
pðyd ¼ 1jz; hÞ : (25)
h> E½zÞ
1 þ expðh that only newswires whose stocks on days that had more
than an 8 percent positive change or 3 percent negative
For simplicity, we also consider using the MAP estimate change from the previous day were considered. These cut-
of the parameters directly. In many cases we find there is offs were chosen so that the number of declining stocks
not a significant performance benefit of using parameter were similar to the number of rising ones, and to ignore
sampling over using the MAP solution directly. minor changes of prices due to other factors. This resulted
A sampling step of the supervised HDP algorithm that in a data set of 1,518 documents and a vocabulary of 1,895.
samples the regression coefficients (the sampled model) is The review snippet classification data set [22] consists of
shown below in pseudocode. To initialize, words are ran- reviews from the Rotten Tomatoes website with reviews that
domly allocated to topics so z is set randomly, K is set to were marked as fresh labelled as positive reviews and
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
250 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

reviews that were marked as rotten labelled as negative iterations with the coefficients being sampled every itera-
reviews. The data set contains 5,331 positive snippets with tion. For predicting the responses of the test documents, 500
the same number of negative ones and a vocabulary of iterations of topic sampling were used to allow the inferred
4,310. topics to converge. The number of iterations was chosen by
The review snippet regression data set [22] consists of looking at the trace plots of the residuals and the regression
reviews written by four film critics where the writer addi- coefficients, which appeared to converge by that number of
tionally assigned a rating to his or her review. The ratings iterations. To compare our models, we carried out experi-
were normalised to be between 0 and 1. Any terms that ments using sLDA with variable numbers of topics so that
appeared in more than 25 percent of the documents were performance with sLDA with the best performing number
removed as were any terms that appeared fewer than five of topics on the test set can also be compared.
times. Only the remaining top 2,179 terms by TF-IDF score We show results for a sHDP inference algorithm that
were then kept. The ratings for each document were prepro- uses the MAP estimate of the regression coefficients (which
cessed to normalise the scores by applying a log transform. reduces computation time) during inference and uses a
There was a total of 5,005 documents. fixed set of coefficients at test time. We also show results for
The document popularity regression data set is a data set an algorithm that samples from the posterior of those coeffi-
of submission descriptions from the Digg website with the cients during training and test time. For the Gaussian
associated number of votes that each submission received. model, we sample from the posterior as in Eq. (20). Using a
The number of votes were again normalised by applying a Gibbs sampling method to sample the coefficients of the
log transform. The data set consisted of a vocabulary of binomial model took many iterations to converge so we
4,120 across 3,880 documents. sampled the coefficients from a Laplace approximation.
Experiments were performed with the sHDP model and Finally, we also do experiments with a two-step algorithm
the sLDA model. Both models were implemented using in which unsupervised topics for the documents are first
MCMC methods (collapsed Gibbs sampling in the case of learnt as in a HDP model and then a GLM model is trained
sLDA with the Chang [23] implementation) and predictions on top of the learnt topics to predict document labels. In this
were done using an equivalent sample in both instances. way, the performance of jointly training the topics and
For sLDA we also applied a variational approach with the the GLM model can be compared with training the two in
Wang [24] implementation, and the results for sLDA are separate steps.
given for both collapsed Gibbs and variational inference
approaches. For sLDA, for collapsed Gibbs sampling, 3,000
iterations were used and for variational inference, EM was 6.1 Results
ran until the relative change in the likelihood bound was Fig. 2 shows that the supervised HDP model performs sig-
less than 0.01 percent. nificantly better than the sLDA model on the newswire data
The accuracy for classification problems and predictive set. For almost all models, sLDA inference using Gibbs sam-
R2 for regression problems after five-fold cross-validation pling performs better than with variational EM, so we have
were calculated. Predictive R2 is defined as not shown the variational EM to avoid clutter. sHDP per-
P forms competitively against sLDA with the best performing
ðy^d yd Þ2 number of topics (as chosen on the test set) on the remaining
pR2 ¼ 1 Pd 2
; (26)
d ðyd yÞ
data sets except for the movie snippet data set. In the movie
snippet data set, sLDA outperformed the sHDP across the
where yd are the observed responses, with d ranging over number of topics. From the results it can be seen that for
P y^d is the response predicted by the model
the documents, sLDA, picking the right number of topics is key to getting
and y ¼ 1=D D d¼1 yd is the mean of the observed responses. good performance. Moreover, picking too few topics or too
This value gives the proportion of variability in the data set many in some cases can cause big drops in performance. On
that is accounted for by the model and is often used to eval- the other hand, for the sHDP, the model yields good perfor-
uate the goodness of fit of a model. A value of 1:0 is mance without having to pick the number of topics. For
obtained when the regression line perfectly fits the data. We sHDP, the results also show that sampling the regression
present accuracy and predictive R2 results that are calcu- coefficients from their posteriors make little difference to
lated on the full set of predictions. We also give indicators the results compared to using the MAP value of the coeffi-
of the minimum and maximum difference in performance cients. Additionally, the simple alternative of learning the
(across the folds) of each method, relative to the sampled topics unsupervised in a HDP model and then training a
HDP. GLM model on top (a two-step supervised approach) per-
In the experiments, the prior standard deviation of the forms significantly worse than jointly learning the topics
parameters z was tested with three values (1, 5 and 10) on with the sHDP model.
each fold’s training set by splitting the fold’s training set The better performance of sHDP compared to sLDA for
into a smaller training and validation set and choosing the the newswire data set and competitive performance with
best value on the validation set. This was also done when the other data sets is partly due to the increased flexibility
choosing the prior standard deviation for sLDA. aw for the of the model and better mixing during inference as can be
sHDP model was set to 0:01 on data sets similar to previous seen in Fig. 3. The increased flexibility comes from the
experiments with HDP. In the sHDP, the standard prior model having an infinite number of topics to model the
Gamma(1; 1) was placed on a and g and these are sampled documents and responses. The better mixing results from
during inference. For sHDP, learning took place over 2;000 the fact that during inference, clusters can be instantiated or
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 251

Fig. 2. Results for the test data sets after five-fold cross-validation. Classification results are given for (a) the newswire data set and (b) the movie
snippet data set. Regression results (R2 ) for the entire data set are given for (c) the movie reviews data set and (d) the document popularity data set.
sLDA Gibbs performance is shown for each number of topics. Variational EM performed as well as or worse than Gibbs sampling and is omitted for
space. For sHDP, the performance with MAP parameters, with parameters sampled from their posteriors and with a two-step supervised approach
where an unsupervised HDP model is learnt and a GLM model trained on top of that is shown. The upper and lower bars show the minimum and
maximum performance of each method relative to the performance of the sampled sHDP (minimum and maximum taken over the five folds). This
allows the reader to see whether a method performs better or worse than sampled sHDP across all the folds.

unneeded ones can be removed while sampling. Since addition, only closing stock prices are used, which means
newly instantiated clusters are empty, it is easier for words that it is possible there were changes in the stock price
to change topic and be allocated to a new cluster. In con- from the general movement of the industry or the market.
trast, in sLDA each topic almost always has a significant However, the sHDP is able to pick out these subtle signals
number of words allocated to it, making it difficult for the whereas sLDA with both types of inference algorithms
distribution of a topic to change. This has the effect of was unable to.
smoothing over term contributions for each topic. Thus, the Fig. 3 is a Gelman-Rubin-Brooks plot [25] which shows
fact that there are more specific topics in the sHDP model how Gelman and Rubin’s shrink factor [26] changes as the
helps to improve performance. number of iterations changes during inference. The plot of
From the relatively low accuracy scores and large stan- the shrink factor is calculated from four parallel MCMC
dard deviations, it can be seen that the labels on the news- chains from different starting points. We present results for
wire data set are much harder to predict than on the movie the Gaussian sHDP model and the Gibbs sampled sLDA
review data set. The standard deviations for the newswire model with 40 topics. The shrink factor is calculated by
scores imply that the data is much more noisy since comparing the within-chain and between-chain variances
newswires only indirectly influence stock movements. In for each variable of interest. The factor predicts that the
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
252 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

Fig. 3. Chain convergence for (TOP) the sHDP model and (BOTTOM) the Gibbs sampled sLDA model with 40 topics. Gelman-Rubin-Brooks plots
show how Gelman and Rubin’s shrink factor for the L2 norm of the regression coefficients and the L2 norm of the residuals changes across iterations
during inference for the movie regression data set. This is shown for four parallel MCMC chains with different starting values. Values close to 1 indi-
cate convergence. From these graphs it can be seen that the Gibbs sampled sLDA model is slower to mix compared to the sHDP model.

chains have converged if the output from the chains are the three data sets with more coherent word usage pat-
indistinguishable, which is given by the factor approaching terns, but is less powerful than a sampled sHDP on docu-
1. In the plots, the shrink factors for the L2 norm of the ment popularity, where the word usage within a topic is
regression coefficients and the L2 norm of the residuals for more varied. The extra topic structure also enables an
the two models are shown. As can be seen from the plots, understanding of the topic-related dependencies which
the shrink factor for the sLDA model is significantly higher we expand on in the next section.
than that of the sHDP model, indicating the sHDP model
exhibits better mixing. 6.2 Analysis of Strong Topics and Terms
We also conducted experiments by regressing directly For the sHDP model, the top positive and negative topics,
on the empirical word distribution for each document in terms of their regression coefficients and their most fre-
with L1 regularized generalized linear models and the quent terms for the movie review problem, are shown in
GLMNET R package. The regularization parameter was Table 1. The topics do not generally correspond to themes
chosen through cross validation on the training set of such as film genre or style. Instead of this, the topics con-
each fold. The accuracies are 61 percent for newswire data tain many names such as actors and directors. This is
set, 75 percent for the movie snippet data set, and R2 of because the flexibility of a nonparametric model means
0:44 for the movie rating data set and 0:064 for the docu- that the top positive and negative topics consist of very
ment popularity data set. Hence this approach marginally few terms and are allocated to actors and directors that are
outperformed both sLDA and sHDP on newswire, movie consistently reviewed well or poorly. This flexibility results
snippet and movie review data sets, but it is outper- in strong topics that are grouped around consistently per-
formed by the sHDP on the document popularity data forming actors or directors but the topics are less coherent
set. Given the large number of available parameters, the since they are associated with so few documents. Topics
model flexibility provides benefits for the L1 GLM, but that consist of more terms, even if those are strong terms,
makes it particularly sensitive to particular word usage. It generally have smaller regression coefficients since the
is therefore understandable that this model does best on effect of the different terms is averaged over other words
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 253

TABLE 1 TABLE 2
The Most Positive and Negative Learnt Topics, in Terms of Their Strongest Topics for the Movie Review Regression Data Set
Regression Coefficients, from the Movie Review Regression Using the Alternative Two-Step Approach Where a Supervised
Data Set with sHDP GLM Model Is Learnt on Top of a Set of Unsupervised Topics
from the HDP

film genres and are more concentrated on specific actors,

which are more likely to perform consistently.
The top positive and negative topics for the two-step
algorithm in which a GLM model is trained on top of unsu-
pervised topics learnt with a HDP model are shown in
Table 2. These topics are different in that there are no topics
like negative topic 2 from the sHDP model and in general
there are much fewer sentiment-related terms in the strong
topics. The fact that the regression coefficients are smaller in
magnitude also indicates that sentiment or rating-predictive
terms are more spread out among the topics meaning that
individual topics are less predictive of the rating.
The top positive and negative topics for the newswire
data set and their most frequent terms are given in Table 3.
These topics are more cohesive than those for the movie
review data set. The top positive topic contains very strong
positive terms such as higher, strong, rise and record, which
all imply good stock performance. The top negative topic
also contains strongly negative terms such as cut, fall,
in the same topic. Strong terms are spread among the top decline and drop, which clearly indicate bad performance.
positive and negative topics, for example, positive topic 5 Similarly to the top topics for the movie review data set, it
contains the positive term charming and negative topic 2 can be seen that some industries consistently have better
contains many negative terms such as unfortunately, worse or poorer stock performance. For example, negative topic
and problem. Since many of the topics have actor and direc- 2 consists of companies such as prudential and metlife along
tor names such as Tom Hanks in positive topic 2 and the with terms such as insurers and insurance. The negative
Coen brothers: Ethan and Joel in positive topic 4, it can be coefficients indicate that the insurance industry may be
seen that specific actors and directors are associated with performing badly. Positive topic 2 with terms such as
consistently better or poorer movie review scores. defense, military and shareholders indicates that companies
The terms in the top topics for sHDP seem to correspond involved with the military and defence are associated with
to people’s names, such as Cameron and Miller, with only rising stock prices.
one topic focusing on terms that intuitively should have a The sHDP learns strong topics that are assigned to fewer
strong contribution to a movie rating. This shows that the words and indicate trends and tendencies at a finer-grained
topics being learnt are divided into those that correspond to level, for example, on the level of actors instead of genres.
the content of the corpus and those that are more focused The sHDP model is useful when more specific trends or ten-
on general terms that affect the rating of a movie. The most dencies are sought and when there is a possibility of overfit-
positive topic and negative topics have no association with ting or underfitting due to the number of topics.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
254 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 2, FEBRUARY 2015

TABLE 3 experiments also showed that jointly learning the topics

The Most Positive and Negative Learnt Topics, and the GLM model produces topics and results that are
in Terms of Regression Coefficients, from the Newswires better than the simple alternative of learning the topics
Data Set with sHDP
unsupervised in a HDP model and training a regression
model on top. Inference in the sHDP remains simple and is
an adaptation of that used in the HDP. The flexibility and
ease of inference of the sHDP means it has potential uses in
many applications. Other inference techniques to improve
performance can be explored such as variational inference
[27]. While the sHDP does not explicitly handle categorical
outcomes, extra regression parameters for each topic can be
added to do so.
While sentiment analysis models such as Pang and Lee
[22] have a similar goal of predicting document labels, the
models we propose in this paper are more general than typi-
cal sentiment analysis models and do not require any boot-
strap dictionary or labels for the terms. Our models can
additionally deal with a wide range of document response
types through a generalised linear model and can easily
incorporate additional information into its generative pro-
cess as well as use unlabelled data. The models in this paper
are not restricted to textual data sets as they can be used on
other kinds of data. For example, topic models have previ-
ously been used on extracted image patches or image fea-
tures by treating the patches or features as words selected
from a dictionary of patches [28]. Similarly, the models in
this paper can be used to predict the keywords of an image
or the theme of an image.

REFERENCES
[1] C. E. Antoniak, “Mixtures of Dirichlet processes with applications
to Bayesian nonparametric problems,” Ann. Statist., vol. 2, no. 6,
pp. 1152–1174, 1974.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet
allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[3] T. L. Griffiths and M. Steyvers, “Finding scientific topics,”
Proc. Nat. Aca. Sci. Am., vol. 101, no. Suppl. 1, pp. 5228–5235,
2004.
[4] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The
author-topic model for authors and documents,” in Proc. 20th
Conf. Uncertainty Artif. Intell., 2004, pp. 487–494.
[5] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” in
Advances in Neural Information Processing Systems 20, vol. 20,
no. 21, J. C. Platt, D. Koller, Y. Singer, and S. Roweis, Eds., Cam-
bridge, MA, USA: MIT Press, 2007, pp. 1–8.
[6] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for
7 CONCLUSIONS Machine Learning. Cambridge, MA, USA: MIT Press, 2006.
[7] B. Shahbaba and R. M. Neal, “Nonlinear models using Dirichlet
We have presented a supervised Bayesian nonparametric process mixtures,” J. Mach. Learn. Res., vol. 10, no. 10, pp. 1829–
model that handles grouped data. Each group of data has 1850, Aug. 2009.
[8] L. Hannah, D. Blei, and W. Powell, “Dirichlet process mixtures of
an associated response such as sentiment ratings or docu- generalized linear models,” J. Mach. Learn. Res., vol. 12, pp. 1923–
ment popularity. The supervised HDP model learns latent 1953, 2011.
topics that are predictive of document responses without [9] D. Mimno and A. McCallum, “Topic models conditioned on arbi-
having to choose a fixed number of topics, a deficiency in trary features with Dirichlet-multinomial regression,” in Proc.
Uncertainty Artif. Intell., 2008, pp. 411–418.
previous models such as supervised LDA (sLDA). In those [10] J. Zhu, A. Ahmed, and E. P. Xing, “MedLDA: Maximum margin
models, overfitting or underfitting can occur if the number supervised topic models for regression and classification,” in
of topics is unsuitable for the data set. The strongest topics Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 1257–1264.
[11] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled
learnt in the sHDP are relatively finer-grained and are LDA: A supervised topic model for credit attribution in multi-
associated with fewer topics allowing their effect on the labeled corpora,” in Proc. Conf. Empirical Methods Natural. Lan-
document response to be learnt easily. Regression and clas- guage Process., 2009, pp. 248–256.
sification experiments were performed on real-world data [12] T. S. Ferguson, “A Bayesian analysis of some nonparametric prob-
lems,” Ann. Statist., vol. 1, no. 2, pp. 209–230, 1973.
sets and showed that the model performs better than [13] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical
sLDA on the newswire data set, and only doing worse Dirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476,
than sLDA on the movie snippet classification data set. The pp. 1566–1581, 2006.
Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
DAI AND STORKEY: THE SUPERVISED HIERARCHICAL DIRICHLET PROCESS 255

[14] P. J. Cowans, “Information retrieval using hierarchical Dirichlet Andrew M. Dai received the MA degree in com-
processes,” in Proc. 27th Annu. Int. ACM SIGIR Conf. Res. Develop. puter science from the University of Cambridge,
Inf. Retrieval, 2004, pp. 564–565. and the PhD degree from the School of Infor-
[15] W. Li, D. Blei, and A. McCallum, “Nonparametric Bayes matics, University of Edinburgh (focusing on
pachinko allocation,” in 23rd Conf. Uncertainty Artif. Intell., Bayesian nonparametric models). He is a soft-
2007, pp. 243–250. ware engineer at Google Inc., Mountain View.
[16] P. Mccullagh and J. Nelder, Generalized Linear Models, 2nd ed., His research interests include Bayesian nonpara-
series Monographs on Statistics & Applied Probability. London, metrics, topic modelling, Dirichlet processes,
U.K.: Chapman & Hall/CRC, Aug. 1989. machine learning, and collaborative filtering.
[17] A. M. Dai and A. J. Storkey, “The grouped author-topic model for
unsupervised entity resolution,” in Proc. 21st Int. Conf. Artificial
Neural Netw., 2011, no. 6791, pp. 241–249.
Amos J. Storkey received the MA degree in
[18] R. M. Neal, “Markov chain sampling methods for Dirichlet pro-
mathematics, a degree in theoretical physics
cess mixture models,” J. Comput. Graphical Statist., vol. 9, no. 2,
(Part III Maths) from Trinity, Cambridge. He
pp. 249–265, 2000.
received the MSc and PhD degrees (Bayesian
[19] P. C. Groenewald and L. Mokgatlhe, “Bayesian computation for
Gaussian processes and neural networks) from
logistic regression,” Comput. Statist. Data Anal., vol. 48, no. 4,
Imperial College, London both were focused on
pp. 857–868, 2005.
machine learning. He is a reader in the School of
[20] D. Mimno, H. Wallach, and A. McCallum, “Gibbs sampling for
Informatics, University of Edinburgh. Prior to his
logistic normal topic models with graph-based priors,” in Proc.
current post, he was research associate in the
NIPS Workshop Analyzing Graphs, 2008.
School of Informatics and Institute for Astronomy,
[21] T. P. Minka, “A comparison of numerical optimizers for logistic
Edinburgh and a Microsoft Research funded
regression,” Carnegie Mellon University, Tech. Rep., http://
fellowship in Informatics, Edinburgh. His research interests include
[Link]/en-us/um/people/minka/papers/logreg/
methods for incentivised distributed machine learning (e.g., Machine
[Link], 2003.
Learning Markets), inference and learning in continuous time systems,
[22] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for
and applications in imaging and medical informatics. He is an associate
sentiment categorization with respect to rating scales,” in Proc.
editor of the IEEE Trans. Pattern Analysis and Machine Intelligence
43rd Annu. Meeting Assoc. Comput. Linguistics, 2005, vol. 43, no. 1,
(TPAMI).
pp. 115–124.
[23] J. Chang, “lda: Collapsed Gibbs sampling methods for topic
models”, R package version 1.3.2., [Link] " For more information on this or any other computing topic,
package=lda, 2012. please visit our Digital Library at [Link]/publications/dlib.
[24] C. Wang, “sLDA EM implementation,” 2009, [Link]
[Link]/~chongw/slda/ with copyright at [Link]
edu/~chongw/slda/[Link]
[25] S. P. Brooks and A. Gelman, “General methods for monitoring
convergence of iterative simulations,” J. Comput. Graphical Statist.,
vol. 7, no. 4, pp. 434–455, Dec. 1998.
[26] A. Gelman and D. B. Rubin, “Inference from iterative simulation
using multiple sequences,” Statist. Sci., vol. 7, no. 4, pp. 457–472,
1992.
[27] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smooth-
ing and inference for topic models,” in Proc. 25th Conf. Uncertainty
Artif. Intell., 2009, pp. 27–34.
[28] C. Wang, D. M. Blei, and F.-F. Li, “Simultaneous image classifica-
tion and annotation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recog., 2009, pp. 1903–1910.

Authorized licensed use limited to: University of Warwick. Downloaded on June 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.

Common questions

The supervised HDP model mitigates overfitting and underfitting by not requiring a predefined number of topics, allowing for the dynamic adjustment of the inferred number based on the data complexity. This Bayesian nonparametric approach facilitates the emergence of the appropriate number of clusters needed for accurate prediction without excessive reliance on training data alone. Moreover, the model's increased mixing ability during inference allows for efficient topic reallocation, helping to capture relevant patterns while discarding unnecessary complexity. These strategies provide more robust handling of diverse data structures and content variations in predictive modeling .

Exchangeability in Bayesian models implies that the order of topics or data points does not affect the probabilistic structure of the model. In practical terms, this ensures that inference does not depend on the sequence of data input. However, when employing maximum a posteriori (MAP) inference instead of a full Bayesian approach, this property can break, resulting in symmetry breaking of parameter inference. This affects the consistency of topic labels and makes the inference sensitive to the approach used. Implementing an inference process that initially considers document content generation ensures more consistent and meaningful topic labels, thereby improving the reliability and interpretability of parameter inference .

The approach of using Generalized Linear Models (GLMs) on top of unsupervised topics differs in predictive accuracy by often underperforming compared to the supervised HDP model. While this two-step approach involves first learning unsupervised topics with a HDP model and then fitting a GLM, it generally yields less accurate predictions due to less integration between data modeling and response prediction. The supervised HDP, by contrast, jointly learns topics aligned with prediction targets, allowing more nuanced understanding and adjustment based on response variables. Experimental results showed that the supervised HDP model outperformed the two-step approach notably in the document popularity data set, indicating higher adaptability and detailed data pattern recognition .

The hierarchical structure in Hierarchical Dirichlet Process (HDP) models allows for a nested arrangement of topics, where each group can have its slightly perturbed distribution around a common base. This facilitates modeling more complex data hierarchies and adapts dynamically to the emergence of new patterns, making it especially effective in variable-topic applications without preset constraints. In contrast, non-hierarchical models like Latent Dirichlet Allocation (LDA) have a fixed structure that necessitates predefining the number of topics, which can limit flexibility and adaptation. As a result, HDP models can offer more nuanced topic inference and better generalization in diverse datasets but with the trade-off of increased computational complexity .

Supervised variations of topic models, such as supervised LDA (sLDA) and supervised Hierarchical Dirichlet Process (sHDP), tackle the limitations of traditional topic models by learning topics that are relevant for prediction tasks. While LDA reduces dimensionality by identifying general corpus themes, sLDA incorporates document responses in the topic learning process, aligning topics more closely with prediction outcomes. However, sLDA still requires a fixed number of topics. The sHDP goes further by introducing a Bayesian nonparametric approach, allowing for a potentially infinite number of latent clusters and automatically adjusting the number of topics to effectively predict document responses. This flexibility helps mitigate overfitting or underfitting issues that can arise with a predefined topic number in predictive tasks .

The benefits of adopting a hierarchical Bayesian approach, such as the Hierarchical Dirichlet Process (HDP), include the ability to model data flexibly with a potentially infinite number of topics or clusters. This allows for automatic adjustment to the underlying complexity of the data without the need to set a predefined number of topics. It addresses the risk of overfitting and underfitting associated with fixed-structure methods. However, the drawback includes increased computational complexity and the challenge of setting prior distributions correctly. Additionally, while more flexible, hierarchical models may require more data for accurate inference compared to simpler models .

The supervised Hierarchical Dirichlet Process (sHDP) model improves predictive performance over supervised Latent Dirichlet Allocation (sLDA) by allowing a flexible topic structure which can adapt to the complexity of the data without a fixed number of topics. This flexibility is particularly advantageous in handling noisy data sets since sHDP can instantiate or remove topics as needed, leading to better detection of subtle data patterns and signals. For example, on the newswire data set, which is significantly noisier due to indirect influences on stock movements, the sHDP model can effectively distinguish relevant topics, whereas sLDA struggles due to its static topic structure .

In Generalised Linear Models (GLMs), the canonical link function is the choice of link function that sets the linear predictor as the natural parameter of the exponential family distribution. Its significance lies in the mathematical convenience and interpretability it provides, as it leads to simplified computation of maximum likelihood estimates for the model parameters. For example, using the canonical logit link function in logistic regression ties the log odds of the probability of success linearly to the predictors. This makes the model fitting and interpretation straightforward and efficient .

The Hierarchical Dirichlet Process (HDP) extends the Latent Dirichlet Allocation (LDA) model by allowing for a countably infinite number of multinomial draws, translating to an infinite number of topics that can grow or shrink according to the data. This solves the problem of determining the best number of topics in LDA and reduces overfitting or underfitting as the number of topics is not fixed but can adapt to the complexity of the data .

Generalised Linear Models (GLMs) offer flexibility in statistical modeling by allowing the analysis of data where the response cannot be assumed to be linear or normally distributed. GLMs include a linear predictor, a link function relating the predictor to the mean response, and a distribution from the exponential family for the response. This enables their application in scenarios like binary outcomes with logistic regression, through which a binomial distribution and logit link function are used. A Gaussian assumption might be inappropriate in cases like binary outcomes, for instance, predicting yes/no responses in a survey, as the normal distribution cannot capture the discrete nature of the data .

Correlated Topic Model Development
No ratings yet
Correlated Topic Model Development
8 pages
Multi-Label Topic Models Survey
No ratings yet
Multi-Label Topic Models Survey
19 pages
LDA Model for Research Paper Categorization
No ratings yet
LDA Model for Research Paper Categorization
5 pages
Understanding Topic Models in Machine Learning
No ratings yet
Understanding Topic Models in Machine Learning
55 pages
W7 Topic Modeling Revised
No ratings yet
W7 Topic Modeling Revised
44 pages
Understanding Topic Models and LDA
No ratings yet
Understanding Topic Models and LDA
77 pages
Optimal Topic Selection in LDA Models
No ratings yet
Optimal Topic Selection in LDA Models
9 pages
Correlated Topic Model for Science Analysis
No ratings yet
Correlated Topic Model for Science Analysis
19 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
Understanding Topic Modelling Techniques
No ratings yet
Understanding Topic Modelling Techniques
31 pages
Network-Based Topic Modeling Approach
No ratings yet
Network-Based Topic Modeling Approach
22 pages
Enhancing Topic Models with Seed Words
No ratings yet
Enhancing Topic Models with Seed Words
10 pages
Hyperparameter Selection for LDA Model
No ratings yet
Hyperparameter Selection for LDA Model
40 pages
Topic Modeling with LDA and NMF Techniques
No ratings yet
Topic Modeling with LDA and NMF Techniques
57 pages
Nonparametric Topic Model DP-MRM
No ratings yet
Nonparametric Topic Model DP-MRM
8 pages
Extracting Philosophy Topics from Reddit
No ratings yet
Extracting Philosophy Topics from Reddit
10 pages
Structural Topic Model for Social Science
No ratings yet
Structural Topic Model for Social Science
4 pages
Topic Modeling Techniques in NLP
No ratings yet
Topic Modeling Techniques in NLP
18 pages
Qta Lse Day8
No ratings yet
Qta Lse Day8
52 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
LDA Algorithm for Topic Modeling
No ratings yet
LDA Algorithm for Topic Modeling
13 pages
LDA: A Powerful Topic Modeling Tool
No ratings yet
LDA: A Powerful Topic Modeling Tool
8 pages
LDA: A Comprehensive Tutorial
No ratings yet
LDA: A Comprehensive Tutorial
27 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Spectral Method for Latent Dirichlet Allocation
No ratings yet
Spectral Method for Latent Dirichlet Allocation
38 pages
Conditional Topic Random Fields Model
No ratings yet
Conditional Topic Random Fields Model
8 pages
LDA and Topic Modeling Survey
No ratings yet
LDA and Topic Modeling Survey
41 pages
BERT-LDA Clustering for Topic Modeling
No ratings yet
BERT-LDA Clustering for Topic Modeling
9 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
64 pages
Understanding Latent Dirichlet Allocation
No ratings yet
Understanding Latent Dirichlet Allocation
47 pages
Understanding Dirichlet Process Models
No ratings yet
Understanding Dirichlet Process Models
7 pages
Correlated Topic Model Overview
No ratings yet
Correlated Topic Model Overview
7 pages
ML For NLP - Topic
No ratings yet
ML For NLP - Topic
22 pages
Deep Learning for Topic Modeling Clustering
No ratings yet
Deep Learning for Topic Modeling Clustering
11 pages
Feng 2023 Crossdomain
No ratings yet
Feng 2023 Crossdomain
11 pages
Text Mining Techniques and Tools Guide
No ratings yet
Text Mining Techniques and Tools Guide
78 pages
Topic Modeling Techniques Explained
No ratings yet
Topic Modeling Techniques Explained
11 pages
Indigen Topic Modeling Insights
No ratings yet
Indigen Topic Modeling Insights
19 pages
LDA: A Beginner's Guide by Ria Kulshrestha
No ratings yet
LDA: A Beginner's Guide by Ria Kulshrestha
9 pages
Understanding Mixture Models in Unsupervised Learning
No ratings yet
Understanding Mixture Models in Unsupervised Learning
16 pages
Understanding Topic Modeling Techniques
No ratings yet
Understanding Topic Modeling Techniques
31 pages
Lecture 4
No ratings yet
Lecture 4
32 pages
Analyzing Net Neutrality Comments
No ratings yet
Analyzing Net Neutrality Comments
5 pages
Survey of Probabilistic Topic Modeling
No ratings yet
Survey of Probabilistic Topic Modeling
5 pages
Markov Topic Models
No ratings yet
Markov Topic Models
8 pages
Text Clustering Techniques Explained
No ratings yet
Text Clustering Techniques Explained
8 pages
Maier 2018
No ratings yet
Maier 2018
27 pages
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
43 pages
Simple Explanation of LDA Topic Modeling
100% (2)
Simple Explanation of LDA Topic Modeling
13 pages
Determining Optimal Topics in LDA
100% (1)
Determining Optimal Topics in LDA
12 pages
LDA Topic Modeling in Communication Research
No ratings yet
LDA Topic Modeling in Communication Research
27 pages
Enhancing Topic Models with Latent Features
No ratings yet
Enhancing Topic Models with Latent Features
16 pages
LDA Topic Modeling for Wikipedia Articles
No ratings yet
LDA Topic Modeling for Wikipedia Articles
13 pages
LDA Topic Modeling for Research Trends
No ratings yet
LDA Topic Modeling for Research Trends
7 pages
CDA2: Counterfactual Diffusion for Sentiment Analysis
No ratings yet
CDA2: Counterfactual Diffusion for Sentiment Analysis
12 pages
Topic Modeling Techniques and Applications
No ratings yet
Topic Modeling Techniques and Applications
10 pages
Supervised Models for Text Classification
No ratings yet
Supervised Models for Text Classification
16 pages
Topic Modeling and Text Summarization
No ratings yet
Topic Modeling and Text Summarization
21 pages
Demo Lesson in English 9 Q3.Docx1
No ratings yet
Demo Lesson in English 9 Q3.Docx1
7 pages
Operations and Composition of Functions
No ratings yet
Operations and Composition of Functions
1 page
Appendix G-MOC Diagnostic Tool Excel Spreadsheet.11-1-07
No ratings yet
Appendix G-MOC Diagnostic Tool Excel Spreadsheet.11-1-07
8 pages
Zero RL Training in Diverse Models
No ratings yet
Zero RL Training in Diverse Models
36 pages
Udyam Registration Certificate for PR Associates
No ratings yet
Udyam Registration Certificate for PR Associates
2 pages
Cost Accounting in Snack Vending Model
No ratings yet
Cost Accounting in Snack Vending Model
35 pages
Cebex 100: Expanding Grout Admixture
No ratings yet
Cebex 100: Expanding Grout Admixture
2 pages
Stress, Strain, and Elasticity Explained
No ratings yet
Stress, Strain, and Elasticity Explained
7 pages
Types and Classification of Aggregates
No ratings yet
Types and Classification of Aggregates
18 pages
Sustainable Tourism at Hundred Islands
No ratings yet
Sustainable Tourism at Hundred Islands
3 pages
AP Physics 1 Fluids Practice Questions
No ratings yet
AP Physics 1 Fluids Practice Questions
2 pages
Plastic Bricks from Polyethylene Waste
No ratings yet
Plastic Bricks from Polyethylene Waste
3 pages
Morphological and Anatomical Studies On Piper Betle L
No ratings yet
Morphological and Anatomical Studies On Piper Betle L
23 pages
Passion and Perseverance in Learning English
No ratings yet
Passion and Perseverance in Learning English
2 pages
Term 2 Lesson Plans: Exploring Plants
No ratings yet
Term 2 Lesson Plans: Exploring Plants
42 pages
Fuzzy-TOPSIS Decision Analysis Example
No ratings yet
Fuzzy-TOPSIS Decision Analysis Example
33 pages
University Learning Outcomes Explained
No ratings yet
University Learning Outcomes Explained
6 pages
Understanding Positive Psychology Basics
No ratings yet
Understanding Positive Psychology Basics
56 pages
4667-Document Upload-18013-1-10-20251026
No ratings yet
4667-Document Upload-18013-1-10-20251026
6 pages
Multicultural Team Dynamics and Sensitivity
No ratings yet
Multicultural Team Dynamics and Sensitivity
14 pages
Conceptual Framework in IMRAD Format
No ratings yet
Conceptual Framework in IMRAD Format
7 pages
Solitary Wave Solutions in Plasma Physics
No ratings yet
Solitary Wave Solutions in Plasma Physics
21 pages
Hypothesis Testing for Known Variance
No ratings yet
Hypothesis Testing for Known Variance
6 pages
Cagdianao CSO Consultation for People's Council
No ratings yet
Cagdianao CSO Consultation for People's Council
19 pages
Daily Study and Creative Schedule Template
No ratings yet
Daily Study and Creative Schedule Template
4 pages
Grade 9 Festival Dance Lesson Plan
No ratings yet
Grade 9 Festival Dance Lesson Plan
5 pages
Integrating FDA and GMP Guidelines in China
No ratings yet
Integrating FDA and GMP Guidelines in China
20 pages
Vertical Stress Analysis in Soil Mass
No ratings yet
Vertical Stress Analysis in Soil Mass
41 pages
411 KJ Seniority List of Personnel
No ratings yet
411 KJ Seniority List of Personnel
14 pages
Understanding Communication Models
No ratings yet
Understanding Communication Models
8 pages

Supervised Hierarchical Dirichlet Process

Uploaded by

Supervised Hierarchical Dirichlet Process

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO.

2, FEBRUARY 2015 243

The Supervised Hierarchical Dirichlet Process

B AYESIAN nonparametric models allow the number of

3.3 The Supervised Topic Model 4 THE SUPERVISED HDP MODEL

as random variables, whereas in sLDA, the regression coef-

E½yjz; h h> E½z: (21)

5.2.2 Binomial Model

The gradient is then

where sðÞ is the logistic sigmoid function,

film genres and are more concentrated on specific actors,

TABLE 3 experiments also showed that jointly learning the topics

Common questions

What strategies are implemented in the supervised HDP model to mitigate the challenges of overfitting or underfitting, which are often encountered in predictive modeling?

Discuss the application of exchangeability in topic assignments in Bayesian models and its implications for parameter inference.

How does the approach involving GLMs over unsupervised topics differ in predictive accuracy compared to the supervised HDP model, based on the experiments conducted in the sources?

Evaluate the impact of the hierarchical structure in HDP models on topic inference compared to non-hierarchical models like LDA.

In what ways do the supervised variations of topic models, such as sLDA and sHDP, attempt to tackle the limitations of traditional topic models in predictive tasks?

What are the benefits and drawbacks of adopting a hierarchical Bayesian approach, such as the HDP, over methods that require predefined structures for unsupervised learning?

How does the supervised HDP model improve upon the predictive performance in supervised learning tasks compared to sLDA, particularly with noisy data sets?

Explain the concept of the canonical link function in the context of GLMs and its significance.

How does the Hierarchical Dirichlet Process (HDP) extend the limitations of the Latent Dirichlet Allocation (LDA) model in terms of topic numbers, and what implications does this have for overfitting or underfitting issues?

Discuss how generalised linear models (GLMs) provide flexibility in statistical modeling and mention a scenario where a Gaussian assumption might be inappropriate.

You might also like

Supervised Hierarchical Dirichlet Process

Uploaded by

Supervised Hierarchical Dirichlet Process

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO.

2, FEBRUARY 2015 243

The Supervised Hierarchical Dirichlet Process

B AYESIAN nonparametric models allow the number of

3.3 The Supervised Topic Model 4 THE SUPERVISED HDP MODEL

as random variables, whereas in sLDA, the regression coef-

E½yjz; h h> E½z: (21)

5.2.2 Binomial Model

The gradient is then

where sðÞ is the logistic sigmoid function,

film genres and are more concentrated on specific actors,

TABLE 3 experiments also showed that jointly learning the topics

Common questions

What strategies are implemented in the supervised HDP model to mitigate the challenges of overfitting or underfitting, which are often encountered in predictive modeling?

What strategies are implemented in the supervised HDP model to mitigate the challenges of overfitting or underfitting, which are often encountered in predictive modeling?

Discuss the application of exchangeability in topic assignments in Bayesian models and its implications for parameter inference.

Discuss the application of exchangeability in topic assignments in Bayesian models and its implications for parameter inference.

How does the approach involving GLMs over unsupervised topics differ in predictive accuracy compared to the supervised HDP model, based on the experiments conducted in the sources?

How does the approach involving GLMs over unsupervised topics differ in predictive accuracy compared to the supervised HDP model, based on the experiments conducted in the sources?

Evaluate the impact of the hierarchical structure in HDP models on topic inference compared to non-hierarchical models like LDA.

Evaluate the impact of the hierarchical structure in HDP models on topic inference compared to non-hierarchical models like LDA.

In what ways do the supervised variations of topic models, such as sLDA and sHDP, attempt to tackle the limitations of traditional topic models in predictive tasks?

In what ways do the supervised variations of topic models, such as sLDA and sHDP, attempt to tackle the limitations of traditional topic models in predictive tasks?

What are the benefits and drawbacks of adopting a hierarchical Bayesian approach, such as the HDP, over methods that require predefined structures for unsupervised learning?

What are the benefits and drawbacks of adopting a hierarchical Bayesian approach, such as the HDP, over methods that require predefined structures for unsupervised learning?

How does the supervised HDP model improve upon the predictive performance in supervised learning tasks compared to sLDA, particularly with noisy data sets?

How does the supervised HDP model improve upon the predictive performance in supervised learning tasks compared to sLDA, particularly with noisy data sets?

Explain the concept of the canonical link function in the context of GLMs and its significance.

Explain the concept of the canonical link function in the context of GLMs and its significance.

How does the Hierarchical Dirichlet Process (HDP) extend the limitations of the Latent Dirichlet Allocation (LDA) model in terms of topic numbers, and what implications does this have for overfitting or underfitting issues?

How does the Hierarchical Dirichlet Process (HDP) extend the limitations of the Latent Dirichlet Allocation (LDA) model in terms of topic numbers, and what implications does this have for overfitting or underfitting issues?

Discuss how generalised linear models (GLMs) provide flexibility in statistical modeling and mention a scenario where a Gaussian assumption might be inappropriate.

Discuss how generalised linear models (GLMs) provide flexibility in statistical modeling and mention a scenario where a Gaussian assumption might be inappropriate.

You might also like

E½yjz; h h> E½z: (21)