## The nested chinese restaurant process and bayesian inference of topic hierarchies (2007)

### Cached

### Download Links

Citations: | 55 - 9 self |

### BibTeX

@MISC{Blei07thenested,

author = {David M. Blei and Thomas L. Griffiths and Michael I. Jordan},

title = {The nested chinese restaurant process and bayesian inference of topic hierarchies},

year = {2007}

}

### OpenURL

### Abstract

We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitely-deep, infinitely-branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on flexible data structures.

### Citations

3737 |
Stochastic relaxation, gibbs distributions, and the bayesian restoration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...d the target distribution is the conditional distribution of these latent variables given the observed data. The particular MCMC algorithm that we present in this paper is a Gibbs sampling algorithm [=-=Geman and Geman 1984-=-; Gelfand and Smith 1990]. In a Gibbs sampler each latent variable is iteratively sampled conditioned on the observations and all the other latent variables. We employ collapsed Gibbs sampling [Liu 19... |

3038 |
Probability and Measure
- Billingsley
- 1995
(Show Context)
Citation Context ... Moreover, if we consider all possible finite partitions of Ω, the resulting Dirichlet distributions are consistent with each other. This suggests, by an appeal to the Kolmogorov consistency theorem [=-=Billingsley 1995-=-], that we can view G as a draw from an underlying stochastic process, where the index set is the set of Borel sets of Ω. Although this naive appeal to the Kolmogorov consistency theorem runs aground ... |

2730 | Indexing by Latent Semantic Analysis
- Dumais, T, et al.
- 1990
(Show Context)
Citation Context ...arly work on topic modeling derived from latent semantic analysis, an application of the singular value decomposition in which “topics” are viewed post hoc as the basis of a low-dimensional subspace [=-=Deerwester et al. 1990-=-]. Subsequent work treated topics as probability distributions over words and used likelihood-based methods to estimate these distributions from a corpus [Hofmann 1999b]. In both of these approaches, ... |

2392 | Latent Dirichlet allocation
- BLEI, NG, et al.
- 2002
(Show Context)
Citation Context ...ds can be deployed in a specific unsupervised machine learning problem of significant current interest—that of learning topic models for collections of text, images and other semi-structured corpora [=-=Blei et al. 2003-=-; Griffiths and Steyvers 2006; Blei and Lafferty 2009]. Let us briefly introduce the problem here; a more formal presentation appears in Section 4. A topic is defined to be a probability distribution ... |

2102 | Emergence of scaling in random networks
- Barabási, Albert
- 1999
(Show Context)
Citation Context ...ortant aspects of the CRP, in particular the “preferential attachment” dynamics that are built into Eq. (1). Probability structures of this form have been used as models in a variety of applications [=-=Barabasi and Reka 1999-=-; Krapivsky and Redner 2001; Albert and Barabasi 2002; Drinea et al. 2006], and the clustering that they induce makes them a reasonable starting place for a hierarchical topic model. In fact, these tw... |

2055 |
The Elements of Statistical Learning
- HASTIE, TIBSHIRANI, et al.
- 2001
(Show Context)
Citation Context ...such as decision trees, boosting and nearest neighbor methods are nonparametric, as are the class of supervised learning systems built on “kernel methods,” including the support vector machine. (See [=-=Hastie et al. 2001-=-] for a good review of these methods.) Theoretical developments in supervised learning have shown that as the number of data points grows, these methods can converge to the true labeling function unde... |

1250 | Bayesian Data Analysis - Gelman, Carlin, et al. - 1995 |

1202 | Statistical mechanics of complex networks
- Albert, Barabási
(Show Context)
Citation Context ...ortant aspects of the CRP, in particular the “preferential attachment” dynamics that are built into Eq. (1). Probability structures of this form have been used as models in a variety of applications (=-=Albert and Barabasi, 2002-=-; Barabasi and Reka, 1999; Krapivsky and Redner, 2001; Drinea et al., 2006), and the clustering that they induce makes them a reasonable starting place for a hierarchical topic model. In fact, these t... |

1099 |
Graphical Models
- LAURITZEN
- 1996
(Show Context)
Citation Context ... data may have been generated. Probabilistic graphical models (also known as “Bayesian networks” and “Markov random fields”) have emerged as a broadly useful approach to specifying generative models [=-=Lauritzen 1996-=-; Jordan 2000]. The elegant marriage of graph theory and probability theory in graphical models makes it possible to take a fully probabilistic (i.e., Bayesian) approach to unsupervised learning in wh... |

1042 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ... contain parameters (“hyper-hyperparameters”), but the resulting inferences are less influenced by these hyper-hyperparameters than they are by fixing the original hyperparameters to specific values [=-=Bernardo and Smith 1994-=-]. To incorporate this extension into the Gibbs sampler, we interleave MetropolisHastings (MH) steps between iterations of the Gibbs sampler to obtain new values of m, π, γ, and η. This preserves the ... |

1001 | A Probabilistic Theory of Pattern Recognition - Devroye, Györfi, et al. - 1996 |

988 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...r each of these outer samples, we collect 800 samples of the latent variables given the held-out documents and approximate their conditional probability given the outer sample with the harmonic mean [=-=Kass and Raftery 1995-=-]. Finally, these conditional probabilities are averaged to obtain an approximation to Eq. (7). Figure 9 illustrates the five-fold cross-validated held-out likelihood for hLDA and LDA on the JACM corp... |

909 | Monte Carlo Statistical Methods
- Robert, Casella
- 1999
(Show Context)
Citation Context ...proximate the posterior for hLDA. In MCMC, one samples from a target distribution on a set of variables by constructing a Markov chain that has the target distribution as its stationary distribution [=-=Robert and Casella 2004-=-]. One then samples from the chain for sufficiently long that it approaches the target, collects the sampled states Journal of the ACM, Vol. V, No. N, Month 20YY.The nested Chinese restaurant process... |

792 | Probabilistic latent semantic indexing
- Hofmann
- 1999
(Show Context)
Citation Context ...-dimensional subspace [Deerwester et al. 1990]. Subsequent work treated topics as probability distributions over words and used likelihood-based methods to estimate these distributions from a corpus [=-=Hofmann 1999-=-b]. In both of these approaches, the interpretation of “topic” differs in Journal of the ACM, Vol. V, No. N, Month 20YY.The nested Chinese restaurant process · 121 key ways from the clustering metaph... |

714 |
A bayesian analysis of some nonparametric problems
- Ferguson
- 1973
(Show Context)
Citation Context ...ets of Ω. Although this naive appeal to the Kolmogorov consistency theorem runs aground on measure-theoretic difficulties, the basic idea is correct and can be made rigorous via a different approach [=-=Ferguson 1973-=-]. The resulting stochastic process is known as the Dirichlet process. Note that if we truncate the stick-breaking process after L−1 breaks, we obtain a Dirichlet distribution on an L-dimensional vect... |

630 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ... This induces a probabilistic clustering of the generated data because customers sitting around each table share the same parameter vector. This model is in the spirit of a traditional mixture model [=-=Titterington et al. 1985-=-], but is critically different in that the number of tables is unbounded. Data analysis amounts to inverting the generative process to determine a probability distribution on the “seating assignment” ... |

628 |
Finding scientific topics
- Griffiths, Steyvers
- 2004
(Show Context)
Citation Context ...variables. We employ collapsed Gibbs sampling [Liu 1994], in which we marginalize out some of the latent variables to speed up the convergence of the chain. Collapsed Gibbs sampling for topic models [=-=Griffiths and Steyvers 2004-=-] has been widely used in a number of topic modeling applications [McCallum et al. 2004; Rosen-Zvi et al. 2004; Mimno and McCallum 2007; Dietz et al. 2007; Newman et al. 2006]. In hLDA, we sample the ... |

614 | Learning in graphical models
- Jordan, editor
- 1998
(Show Context)
Citation Context ...een generated. Probabilistic graphical models (also known as “Bayesian networks” and “Markov random fields”) have emerged as a broadly useful approach to specifying generative models [Lauritzen 1996; =-=Jordan 2000-=-]. The elegant marriage of graph theory and probability theory in graphical models makes it possible to take a fully probabilistic (i.e., Bayesian) approach to unsupervised learning in which efficient... |

549 | A Bayesian hierarchical model for learning natural scene categories
- Fei-Fei, Perona
- 2005
(Show Context)
Citation Context ...n of image features. Indeed, topic models that are special cases of the model presented here, i.e., with no hierarchy and with a fixed number of topics, have appeared in the recent vision literature (=-=Fei-Fei and Perona, 2005-=-; Russell et al., 2006). In a genomics setting, “documents” might be sets of DNA sequences, and topics will reflect regulatory motifs. In a collaborative filtering setting, “documents” might be the pu... |

542 | Hierarchical Dirichlet processes
- Teh, Jordan, et al.
(Show Context)
Citation Context ...s is a fixed parameter, and a model selection procedure is required to choose the number of topics. (A Bayesian nonparametric solution to this can be obtained with the hierarchical Dirichlet process [=-=Teh et al. 2007-=-].) Second, given a set of topics, LDA places no constraints on the usage of the topics by documents in the corpus; a document can place an arbitrary probability distribution on the topics. In hLDA, o... |

513 | Tarjan, “A new approach to the maximum-flow problem
- Goldberg, E
- 1988
(Show Context)
Citation Context ...graphs vertices edge edges trees tree search regular string wd,n zd,n Fig. 4. A single state of the Markov chain in the Gibbs sampler for the abstract of “A new approach to the maximum-flow problem” [=-=Goldberg and Tarjan, 1986-=-]. The document is associated with a path through the hierarchy cd, and each node in the hierarchy is associated with a distribution over terms. (The five most probable terms are illustrated.) Finally... |

421 | Hierarchically classifying documents using very few words
- Koller, Salaami
- 1997
(Show Context)
Citation Context ... of methods that employ hierarchies in analyzing text data. In one line of work, the algorithms are given a hierarchy of document categories, and their goal is to correctly place documents within it [=-=Koller and Sahami 1997-=-; Chakrabarti et al. 1998; McCallum et al. 1999; Dumais and Chen 2000]. Other work has focused on deriving hierarchies of individual terms using side information, such as a grammar or a thesaurus, tha... |

418 |
Mixtures of dirichlet processes with applications to bayesian nonparametric problems
- Antoniak
- 1974
(Show Context)
Citation Context ... stochastic process theory and Bayesian nonparametric statistics, specifically the Chinese restaurant process [Aldous 1985], stick-breaking processes [Pitman 2002], and the Dirichlet process mixture [=-=Antoniak 1974-=-]. In this section we briefly review these ideas and the connections between them. 2.1 Dirichlet and beta distributions Recall that the Dirichlet distribution is a probability distribution on the simp... |

416 |
Inference of population structure using multilocus genotype data: dominant markers and null alleles
- Falush, Stephens, et al.
- 2007
(Show Context)
Citation Context ... readily applied to biological data sets, purchasing data, collections of images, or social network data. (Note that applications in such domains have already been demonstrated for flat topic models [=-=Pritchard et al. 2000-=-; Marlin 2003; Fei-Fei and Perona 2005; Blei and Jordan 2003; Airoldi et al. 2008].) Finally, as a Bayesian nonparametric model, our approach can accommodate future data that might lie in new and prev... |

400 | Bayesian density estimation and inference using mixtures
- Escobar, West
- 1995
(Show Context)
Citation Context ...., the distribution is invariant to the order of the arrival of customers [Pitman 2002]. This exchangeability property makes CRPbased models amenable to posterior inference using Monte Carlo methods [=-=Escobar and West 1995-=-; MacEachern and Muller 1998; Neal 2000]. The nCRP is closely related to a stochastic process known as the nested Dirichlet process (nDP), which has been proposed independently of our work by [Rodrígu... |

394 | Dynamic topic models
- Blei, Lafferty
- 2006
(Show Context)
Citation Context ...achine learning problem of significant current interest—that of learning topic models for collections of text, images and other semi-structured corpora [Blei et al. 2003; Griffiths and Steyvers 2006; =-=Blei and Lafferty 2009-=-]. Let us briefly introduce the problem here; a more formal presentation appears in Section 4. A topic is defined to be a probability distribution across words from a vocabulary. Given an input corpus... |

387 |
Learning to classify text using support vector machines
- Joachims
- 2002
(Show Context)
Citation Context ...d collection of documents. This approach is to be contrasted with a supervised learning approach in which it is assumed that each document is labeled according to one or more topics defined a priori (=-=Joachims, 2002-=-). In this work, we do not assume that hand-labeled topics are available; indeed, we do not assume that the number of topics or their hierarchical structure is known a priori. By defining a probabilis... |

375 | Markov Chain Sampling Methods for Dirichlet Process Mixture Models
- Neal
- 2000
(Show Context)
Citation Context ...rival of customers (Pitman, 2002). This exchangeability property makes CRP-based models amenable to posterior inference using Monte Carlo methods (Escobar and West, 1995; MacEachern and Muller, 1998; =-=Neal, 2000-=-), when the exchangeable partition distribution is combined with a data generating distribution. 4. HIERARCHICAL LATENT DIRICHLET ALLOCATION The nested CRP provides a way to define a prior on tree top... |

373 |
Sampling based approaches to calculating marginal densities
- Gelfand, Smith
- 1990
(Show Context)
Citation Context ...ion is the conditional distribution of these latent variables given the observed data. The particular MCMC algorithm that we present in this paper is a Gibbs sampling algorithm [Geman and Geman 1984; =-=Gelfand and Smith 1990-=-]. In a Gibbs sampler each latent variable is iteratively sampled conditioned on the observations and all the other latent variables. We employ collapsed Gibbs sampling [Liu 1994], in which we margina... |

333 | Web document clustering: A feasibility demonstration - Zamir, Etzioni - 1998 |

331 | Modeling annotated data - Blei, Jordan - 2003 |

314 | A Constructive Definition of Dirichlet Priors - Sethuraman - 1994 |

267 | Hierarchical classification of Web content - Dumais, Chen |

266 |
Ferguson distributions via polya urn schemes
- Blackwell, MacQueen
- 1973
(Show Context)
Citation Context ...ed Dirichlet process (nDP), which has been proposed independently of our work by [Rodríguez et al. 2008]. Indeed, just as the CRP can be obtained be obtained by integrating out the Dirichlet process [=-=Blackwell and MacQueen 1973-=-], a K-level nCRP can be obtained by integrating out the Dirichlet processes in a K-level nDP. 4. HIERARCHICAL LATENT DIRICHLET ALLOCATION The nested CRP provides a way to define a prior on tree topol... |

237 | P.: The author-topic model for authors and documents
- Rosen-Zvi, Griffiths, et al.
- 2004
(Show Context)
Citation Context ... model the documents are time stamped and the underlying topics change over time [Blei and Lafferty 2006]; in the author-topic model the authorship of the documents affects which topics they exhibit [=-=Rosen-Zvi et al. 2004-=-]. This said, some extensions are more easily adaptable than others. In the correlated topic model, the topic proportions exhibit a covariance structure [Blei and Lafferty 2007]. This is achieved by r... |

228 | Deriving concept hierarchies from text
- Sanderson, Croft
- 1999
(Show Context)
Citation Context ...999; Dumais and Chen 2000]. Other work has focused on deriving hierarchies of individual terms using side information, such as a grammar or a thesaurus, that are sometimes available for text domains [=-=Sanderson and Croft 1999-=-; Stoica and Hearst 2004; Cimiano et al. 2005]. Our method provides still another way to employ a notion of hierarchy in text analysis. First, rather than learn a hierarchy of terms we learn a hierarc... |

190 | Hierarchical topic models and the nested Chinese restaurant process
- Blei, Griffiths, et al.
(Show Context)
Citation Context ...ds can be deployed in a specific unsupervised machine learning problem of significant current interest—that of learning topic models for collections of text, images and other semi-structured corpora [=-=Blei et al. 2003-=-; Griffiths and Steyvers 2006; Blei and Lafferty 2009]. Let us briefly introduce the problem here; a more formal presentation appears in Section 4. A topic is defined to be a probability distribution ... |

185 | Infinite latent feature models and the Indian buffet process
- Griffiths, Ghahramani
- 2006
(Show Context)
Citation Context ...tation. This is a growing theme in Bayesian nonparametric research. For example, one line of recent research has explored stochastic processes involving multiple binary features rather than clusters [=-=Griffiths and Ghahramani 2006-=-; Thibaux and Jordan 2007; Teh et al. 2007]. A parallel line of investigation has explored alternative posterior inference techniques for Bayesian nonparametric models, providing more efficient algori... |

178 | Mixed membership stochastic blockmodels
- Airoldi, Blei, et al.
(Show Context)
Citation Context ...r social network data. (Note that applications in such domains have already been demonstrated for flat topic models [Pritchard et al. 2000; Marlin 2003; Fei-Fei and Perona 2005; Blei and Jordan 2003; =-=Airoldi et al. 2008-=-].) Finally, as a Bayesian nonparametric model, our approach can accommodate future data that might lie in new and previously undiscovered parts of the tree. Previous work commits to a single fixed tr... |

148 | Fast and effective text mining using linear-time document clustering. Paper presented at the fifth ACM SIGKDD international conference on Knowledge discovery and data mining - Larsen, Aone - 1999 |

143 |
Estimating Mixture of Dirichlet Process Models
- MacEachern, Müller
- 1998
(Show Context)
Citation Context ...invariant to the order of the arrival of customers [Pitman 2002]. This exchangeability property makes CRPbased models amenable to posterior inference using Monte Carlo methods [Escobar and West 1995; =-=MacEachern and Muller 1998-=-; Neal 2000]. The nCRP is closely related to a stochastic process known as the nested Dirichlet process (nDP), which has been proposed independently of our work by [Rodríguez et al. 2008]. Indeed, jus... |

130 | Variational inference for Dirichlet process mixtures
- Blei, Jordan
- 2005
(Show Context)
Citation Context ...cally, variational methods, which replace sampling with optimization, have been developed for Dirichlet process mixtures to further increase their applicability to large-scale data analysis problems [=-=Blei and Jordan 2005-=-; Kurihara et al. 2007]. The hierarchical topic model that we explored in this paper is just one example of how this synthesis of statistics and computer science can produce powerful new tools for the... |

127 |
Exchangeability and related topics
- Aldous
- 1985
(Show Context)
Citation Context ...n 7. 2. BACKGROUND Our approach to topic modeling reposes on several building blocks from stochastic process theory and Bayesian nonparametric statistics, specifically the Chinese restaurant process [=-=Aldous 1985-=-], stick-breaking processes [Pitman 2002], and the Dirichlet process mixture [Antoniak 1974]. In this section we briefly review these ideas and the connections between them. 2.1 Dirichlet and beta dis... |

120 |
Connectivity of growing random networks
- Krapivsky, Redner, et al.
- 2000
(Show Context)
Citation Context ...P, in particular the “preferential attachment” dynamics that are built into Eq. (1). Probability structures of this form have been used as models in a variety of applications [Barabasi and Reka 1999; =-=Krapivsky and Redner 2001-=-; Albert and Barabasi 2002; Drinea et al. 2006], and the clustering that they induce makes them a reasonable starting place for a hierarchical topic model. In fact, these two points are intimately rel... |

115 |
Monte Carlo Statistical Methods (Springer Texts in Statistics
- Robert, Casella
- 2005
(Show Context)
Citation Context ...proximate the posterior for hLDA. In MCMC, one samples from a target distribution on a set of variables by constructing a Markov chain that has the target distribution as its stationary distribution (=-=Robert and Casella, 2004-=-). One then samples from the chain for sufficiently long that it approaches the target, collects the sampled states thereafter, and uses those collected states to estimate the target. This approach is... |

114 |
Probabilistic topic models
- Steyvers, Griffiths
- 2006
(Show Context)
Citation Context ... to competing unigram language models, and it has also been argued that the topic-based analysis provided by LDA represents a qualitative improvement on competing language models (Blei et al., 2003b; =-=Griffiths and Steyvers, 2006-=-). Thus LDA provides a natural point of comparison. On the other hand, there are several issues that must be borne in mind in comparing hLDA to LDA. First, in LDA the number of topics is a fixed param... |

106 | Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
- Chakrabarti, Dom, et al.
- 1998
(Show Context)
Citation Context ...hierarchies in analyzing text data. In one line of work, the algorithms are given a hierarchy of document categories, and their goal is to correctly place documents within it [Koller and Sahami 1997; =-=Chakrabarti et al. 1998-=-; McCallum et al. 1999; Dumais and Chen 2000]. Other work has focused on deriving hierarchies of individual terms using side information, such as a grammar or a thesaurus, that are sometimes available... |

99 | Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis
- Cimiano, Hotho, et al.
(Show Context)
Citation Context ... on deriving hierarchies of individual terms using side information, such as a grammar or a thesaurus, that are sometimes available for text domains [Sanderson and Croft 1999; Stoica and Hearst 2004; =-=Cimiano et al. 2005-=-]. Our method provides still another way to employ a notion of hierarchy in text analysis. First, rather than learn a hierarchy of terms we learn a hierarchy of topics, where a topic is a distribution... |

85 |
A correlated topic model of Science
- BLEI, D
- 2007
(Show Context)
Citation Context ...ich topics they exhibit [Rosen-Zvi et al. 2004]. This said, some extensions are more easily adaptable than others. In the correlated topic model, the topic proportions exhibit a covariance structure [=-=Blei and Lafferty 2007-=-]. This is achieved by replacing a Dirichlet distribution with a logistic normal, and the application of Bayesian nonparametric extensions is less direct. Journal of the ACM, Vol. V, No. N, Month 20YY... |

80 |
Urn models and their application. An approach to modern discrete probability theory. Wiley Series in Probability and Mathematical Statistics
- Johnson, Kotz
- 1977
(Show Context)
Citation Context ...distribution G0. Each customer is associated with the parameter vector at the table at which he sits. The resulting distribution on sequences of parameter vectors is referred to as a Pólya urn model [=-=Johnson and Kotz 1977-=-]. The Pólya urn distribution can be used to define a flexible clustering model. Let the parameters at the tables index a family of probability distributions (for example, the distribution might be a ... |