Results 1  10
of
77
Hierarchical Dirichlet processes
 Journal of the American Statistical Association
, 2004
"... program. The authors wish to acknowledge helpful discussions with Lancelot James and Jim Pitman and the referees for useful comments. 1 We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture comp ..."
Abstract

Cited by 536 (55 self)
 Add to MetaCart
program. The authors wish to acknowledge helpful discussions with Lancelot James and Jim Pitman and the referees for useful comments. 1 We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the wellknown clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction
 In Proceedings of NAACLHLT 2009. Shay
, 2009
"... We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, prov ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM algorithm for learning a probabilistic grammar based on this family of priors. We then experiment with unsupervised dependency grammar induction and show significant improvements using our model for both monolingual learning and bilingual learning with a nonparallel, multilingual corpus. 1
Sampling alignment structure under a Bayesian translation model
 In Empirical Methods in Natural Language Processing (EMNLP
, 2008
"... We describe the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment. We propose and evaluate two nonparametric priors that successfully avoid the degenerate behavior noted in previous work, where overly large phrases memoriz ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We describe the first tractable Gibbs sampling procedure for estimating phrase pair frequencies under a probabilistic model of phrase alignment. We propose and evaluate two nonparametric priors that successfully avoid the degenerate behavior noted in previous work, where overly large phrases memorize the training data. Phrase table weights learned under our model yield an increase in BLEU score over the wordalignment based heuristic estimates used regularly in phrasebased translation systems. 1
Distance dependent Chinese restaurant processes
"... We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for nonexchangeability. This class can be used to model dependencies between data in infinite clustering models, including dependencies across time or space. We examine t ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for nonexchangeability. This class can be used to model dependencies between data in infinite clustering models, including dependencies across time or space. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both observed and mixture settings. We study its performance with timedependent models and three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data. We also show its alternative formulation of the traditional CRP leads to a fastermixing Gibbs sampling algorithm than the one based on the original formulation. 1.
Indian Buffet Processes with Powerlaw Behavior
"... The Indian buffet process (IBP) is an exchangeable distribution over binary matrices used in Bayesian nonparametric featural models. In this paper we propose a threeparameter generalization of the IBP exhibiting powerlaw behavior. We achieve this by generalizing the beta process (the de Finetti me ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
The Indian buffet process (IBP) is an exchangeable distribution over binary matrices used in Bayesian nonparametric featural models. In this paper we propose a threeparameter generalization of the IBP exhibiting powerlaw behavior. We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the stablebeta process and deriving the IBP corresponding to it. We find interesting relationships between the stablebeta process and the PitmanYor process (another stochastic process used in Bayesian nonparametric models with interesting powerlaw properties). We derive a stickbreaking construction for the stablebeta process, and find that our powerlaw IBP is a good model for word occurrences in document corpora. 1
Lossless compression based on the Sequence Memoizer
 In Data Compression Conference 2010
, 2010
"... In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of PitmanYor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of PitmanYor processes of unbounded depth previously proposed by Wood et al. [2009] in the context of language modelling, allows modelling of longrange dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties. 1
A Stochastic Memoizer for Sequence Data
"... We propose an unboundeddepth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
We propose an unboundeddepth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unboundeddepth hierarchical PitmanYor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving stateoftheart results. 1.
Discriminative Clustering by Regularized Information Maximization
"... Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive informationtheoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semisupervised learning. In particular, we instantiate the framework to unsupervised, multiclass kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
Productivity and Reuse in Language
, 2011
"... We present a Bayesian model of the mirror image problems of linguistic productivity and reuse. The model, known as Fragment Grammar, is evaluated against several morphological datasets; its performance is compared to competing theoretical accounts including full–parsing, full–listing, and exemplar–b ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
We present a Bayesian model of the mirror image problems of linguistic productivity and reuse. The model, known as Fragment Grammar, is evaluated against several morphological datasets; its performance is compared to competing theoretical accounts including full–parsing, full–listing, and exemplar–based models. The model is able to learn the correct patterns of productivity and reuse for two very different systems: the English past tense which is characterized by a sharp dichotomy in productivity between regular and irregular forms and English derivational morphology which is characterized by a graded cline from very productive (ness) to very unproductive (th). Keywords:Productivity;Reuse;Storage;Computation; Bayesian Model;Past Tense;Derivational Morphology
An Unsupervised Model for Joint Phrase Alignment and Extraction
"... We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memori ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also nonterminal symbols. This allows for a completely probabilistic model that is able to create a phrase table that achieves competitive accuracy on phrasebased machine translation tasks directly from unaligned sentence pairs. Experiments on several language pairs demonstrate that the proposed model matches the accuracy of traditional twostep word alignment/phrase extraction approach while reducing the phrase table to a fraction of the original size. 1