Results 1  10
of
253
Hierarchical Dirichlet processes.
 Journal of the American Statistical Association,
, 2006
"... We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this s ..."
Abstract

Cited by 942 (78 self)
 Add to MetaCart
(Show Context)
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the wellknown clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stickbreaking process, and a generalization of the Chinese restaurant process that we refer to as the "Chinese restaurant franchise." We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 637 (41 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
A family of algorithms for approximate Bayesian inference
, 2001
"... One of the major obstacles to using Bayesian methods for pattern recognition has been its computational expense. This thesis presents an approximation technique that can perform Bayesian inference faster and more accurately than previously possible. This method, "Expectation Propagation," ..."
Abstract

Cited by 366 (11 self)
 Add to MetaCart
One of the major obstacles to using Bayesian methods for pattern recognition has been its computational expense. This thesis presents an approximation technique that can perform Bayesian inference faster and more accurately than previously possible. This method, "Expectation Propagation," unifies and generalizes two previous techniques: assumeddensity filtering, an extension of the Kalman filter, and loopy belief propagation, an extension of belief propagation in Bayesian networks. The unification shows how both of these algorithms can be viewed as approximating the true posterior distribution with a simpler distribution, which is close in the sense of KLdivergence. Expectation Propagation exploits the best of both algorithms: the generality of assumeddensity filtering and the accuracy of loopy belief propagation. Loopy belief propagation, because it propagates exact belief states, is useful for limited types of belief networks, such as purely discrete networks. Expectation Propagati...
Computational Discovery of Gene Modules, Regulatory Networks and Expression Programs
, 2007
"... Highthroughput molecular data are revolutionizing biology by providing massive amounts of information about gene expression and regulation. Such information is applicable both to furthering our understanding of fundamental biology and to developing new diagnostic and treatment approaches for diseas ..."
Abstract

Cited by 236 (17 self)
 Add to MetaCart
Highthroughput molecular data are revolutionizing biology by providing massive amounts of information about gene expression and regulation. Such information is applicable both to furthering our understanding of fundamental biology and to developing new diagnostic and treatment approaches for diseases. However, novel mathematical methods are needed for extracting biological knowledge from highdimensional, complex and noisy data sources. In this thesis, I develop and apply three novel computational approaches for this task. The common theme of these approaches is that they seek to discover meaningful groups of genes, which confer robustness to noise and compress complex information into interpretable models. I first present the GRAM algorithm, which fuses information from genomewide expression and in vivo transcription factorDNA binding data to discover regulatory networks of
Variational Inference for Bayesian Mixtures of Factor Analysers
 In Advances in Neural Information Processing Systems 12
, 2000
"... We present an algorithm that infers the model structure of a mixture of factor analysers using an ecient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimension ..."
Abstract

Cited by 191 (22 self)
 Add to MetaCart
(Show Context)
We present an algorithm that infers the model structure of a mixture of factor analysers using an ecient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to over tting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples. By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exa...
A bayesian framework for word segmentation: Exploring the effects of context
 In 46th Annual Meeting of the ACL
, 2009
"... Since the experiments of Saffran et al. (1996a), there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of differen ..."
Abstract

Cited by 110 (30 self)
 Add to MetaCart
Since the experiments of Saffran et al. (1996a), there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of different assumptions the learner might make regarding the nature of words – in particular, how these assumptions affect the kinds of words that are segmented from a corpus of transcribed childdirected speech. We develop several models within a Bayesian ideal observer framework, and use them to examine the consequences of assuming either that words are independent units, or units that help to predict other units. We show through empirical and theoretical results that the assumption of independence causes the learner to undersegment the corpus, with many two and threeword sequences (e.g. what’s that, do you, in the house) misidentified as individual words. In contrast, when the learner assumes that words are predictive, the resulting segmentation is far more accurate. These results indicate that taking context into account is important for a statistical word segmentation strategy to be successful, and raise the possibility that even young infants may be able to exploit more subtle statistical patterns than have usually been considered. 1
Bayesian hierarchical clustering
 In Proceedings of the 22nd International Conference on Machine Learning. ACM
, 2005
"... We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distancebased agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which ..."
Abstract

Cited by 72 (11 self)
 Add to MetaCart
(Show Context)
We present a novel algorithm for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. This algorithm has several advantages over traditional distancebased agglomerative clustering algorithms. (1) It defines a probabilistic model of the data which can be used to compute the predictive distribution of a test point and the probability of it belonging to any of the existing clusters in the tree. (2) It uses a modelbased criterion to decide on merging clusters rather than an adhoc distance metric. (3) Bayesian hypothesis testing is used to decide which merges are advantageous and to output the recommended depth of the tree. (4) The algorithm can be interpreted as a novel fast bottomup approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). It provides a new lower bound on the marginal likelihood of a DPM by summing over exponentially many clusterings of the data in polynomial time. We describe procedures for learning the model hyperparameters, computing the predictive distribution, and extensions to the algorithm. Experimental results on synthetic and realworld data sets demonstrate useful properties of the algorithm. 1.
Rational approximations to rational models: Alternative algorithms for category learning
"... Rational models of cognition typically consider the abstract computational problems posed by the environment, assuming that people are capable of optimally solving those problems. This differs from more traditional formal models of cognition, which focus on the psychological processes responsible fo ..."
Abstract

Cited by 61 (19 self)
 Add to MetaCart
(Show Context)
Rational models of cognition typically consider the abstract computational problems posed by the environment, assuming that people are capable of optimally solving those problems. This differs from more traditional formal models of cognition, which focus on the psychological processes responsible for behavior. A basic challenge for rational models is thus explaining how optimal solutions can be approximated by psychological processes. We outline a general strategy for answering this question, namely to explore the psychological plausibility of approximation algorithms developed in computer science and statistics. In particular, we argue that Monte Carlo methods provide a source of “rational process models” that connect optimal solutions to psychological processes. We support this argument through a detailed example, applying this approach to Anderson’s (1990, 1991) Rational Model of Categorization (RMC), which involves a particularly challenging computational problem. Drawing on a connection between the RMC and ideas from nonparametric Bayesian statistics, we propose two alternative algorithms for approximate inference in this model. The algorithms we consider include Gibbs sampling, a procedure