Results 21  30
of
136
Unsupervised deduplication using crossfield dependencies
 in KDD, Las Vegas
, 2008
"... Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue str ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent—because venues tend to focus on a few research areas—but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a crossfield dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models crossfield dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichletmultinomial model over titles, and a nonexchangeable stringedit model over venues. We show that modeling crossfield dependence yields a substantial improvement in performance—a 58 % reduction in error over a standard Dirichlet process mixture.
Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce
"... Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic mode ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA outperforms Mahout both in execution speed and heldout likelihood. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering—topic models,
An Alternative Prior Process for Nonparametric Bayesian Clustering
"... Prior distributions play a crucial role in Bayesian approaches to clustering. Two commonlyused prior distributions are the Dirichlet and PitmanYor processes. In this paper, we investigate the predictive probabilities that underlie these processes, and the implicit “richgetricher ” characteristic ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Prior distributions play a crucial role in Bayesian approaches to clustering. Two commonlyused prior distributions are the Dirichlet and PitmanYor processes. In this paper, we investigate the predictive probabilities that underlie these processes, and the implicit “richgetricher ” characteristic of the resulting partitions. We explore an alternative prior for nonparametric Bayesian clustering—the uniform process—for applications where the “richgetricher ” property is undesirable. We also explore the cost of this process: partitions are no longer exchangeable with respect to the ordering of variables. We present new asymptotic and simulationbased results for the clustering characteristics of the uniform process and compare these with known results for the Dirichlet and PitmanYor processes. We compare performance on a real document clustering task, demonstrating the practical advantage of the uniform process despite its lack of exchangeability over orderings. 1
Sampling table configurations for the hierarchical PoissonDirichlet process
 In ECML. 2011
"... Abstract. Hierarchical modeling and reasoning are fundamental in machine intelligence, and for this the twoparameter PoissonDirichlet Process (PDP) plays an important role. The most popular MCMC sampling algorithm for the hierarchical PDP and hierarchical Dirichlet Process is to conduct an incre ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Hierarchical modeling and reasoning are fundamental in machine intelligence, and for this the twoparameter PoissonDirichlet Process (PDP) plays an important role. The most popular MCMC sampling algorithm for the hierarchical PDP and hierarchical Dirichlet Process is to conduct an incremental sampling based on the Chinese restaurant metaphor, which originates from the Chinese restaurant process (CRP). In this paper, with the same metaphor, we propose a new table representation for the hierarchical PDPs by introducing an auxiliary latent variable, called table indicator, to record which customer takes responsibility for starting a new table. In this way, the new representation allows full exchangeability that is an essential condition for a correct Gibbs sampling algorithm. Based on this representation, we develop a block Gibbs sampling algorithm, which can jointly sample the data item and its table contribution. We test this out on the hierarchical Dirichlet process variant of latent Dirichlet allocation (HDPLDA) developed by Teh, Jordan, Beal and Blei. Experiment results show that the proposed algorithm outperforms their “posterior sampling by direct assignment” algorithm in both outofsample perplexity and convergence speed. The representation can be used with many other hierarchical PDP models.
A Bayesian Review of the PoissonDirichlet Process
, 2010
"... The two parameter PoissonDirichlet process is also known as the PitmanYor Process and related to the ChineseRestaurant Process, is a generalisation of the Dirichlet Process, and is increasingly being used for probabilistic modelling in discrete areas such as language and images. This article revie ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
The two parameter PoissonDirichlet process is also known as the PitmanYor Process and related to the ChineseRestaurant Process, is a generalisation of the Dirichlet Process, and is increasingly being used for probabilistic modelling in discrete areas such as language and images. This article reviews the theory of the PoissonDirichlet process in terms of its consistency for estimation, the convergence rates and the posteriors of data. This theory has been well developed for continuous distributions (more generally referred to as nonatomic distributions). This article then presents a Bayesian interpretation of the PoissonDirichlet process: it is a mixture using an improper and infinite dimensional Dirichlet distribution. This interpretation requires technicalities of priors, posteriors and Hilbert spaces, but conceptually, this means we can understand the process as just another Dirichlet and thus all its sampling properties fit naturally. Finally, this article also presents results for the discrete case which is the case seeing widespread use now in computer science, but which has received less attention in the literature.
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. Google
, 2013
"... We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several wellknown types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned KneserNey 5gram model achieves perplexity 67.6. A combination of techniques leads to 35% reduction in perplexity, or 10 % reduction in crossentropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/heldout data, it also makes available logprobability values for each word in each of ten heldout data sets, for each of the baseline ngram models. 1
A Bayesian Model for Learning SCFGs with Discontiguous Rules
"... We describe a nonparametric model and corresponding inference algorithm for learning Synchronous Context Free Grammar derivations for parallel text. The model employs a PitmanYor Process prior which uses a novel base distribution over synchronous grammar rules. Through both synthetic grammar induct ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
We describe a nonparametric model and corresponding inference algorithm for learning Synchronous Context Free Grammar derivations for parallel text. The model employs a PitmanYor Process prior which uses a novel base distribution over synchronous grammar rules. Through both synthetic grammar induction and statistical machine translation experiments, we show that our model learns complex translational correspondences — including discontiguous, manytomany alignments—and produces competitive translation results. Further, inference is efficient and we present results on significantly larger corpora than prior work. 1
Dependent Hierarchical Normalized Random Measures for Dynamic Topic Modeling
"... We develop dependent hierarchical normalized random measures and apply them to dynamic topic modeling. The dependency arises via superposition, subsampling and point transition on the underlying Poisson processes of these measures. The measures used include normalised generalised Gamma processes tha ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
We develop dependent hierarchical normalized random measures and apply them to dynamic topic modeling. The dependency arises via superposition, subsampling and point transition on the underlying Poisson processes of these measures. The measures used include normalised generalised Gamma processes that demonstrate power law properties, unlike Dirichlet processes used previously in dynamic topic modeling. Inference for the model includes adapting a recently developed slice sampler to directly manipulate the underlying Poisson process. Experiments performed on news, blogs, academic and Twitter collections demonstrate the technique gives superior perplexity over a number of previous models. 1.
Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
"... We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobse ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finitestate transducers with languagespecific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50–100 seed paradigms, adding a 10millionword corpus reduces prediction error for morphological inflections by up to 10%.
PitmanYor ProcessBased Language Models for Machine Translation
"... The hierarchical PitmanYor processbased smoothing method applied to language model was proposed by Goldwater and by Teh; the performance of this smoothing method is shown comparable with the modified KneserNey method in terms of perplexity. Although this method was presented four years ago, there ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
The hierarchical PitmanYor processbased smoothing method applied to language model was proposed by Goldwater and by Teh; the performance of this smoothing method is shown comparable with the modified KneserNey method in terms of perplexity. Although this method was presented four years ago, there has been no paper which reports that this language model indeed improves translation quality in the context of Machine Translation (MT). This is important for the MT community since an improvement in perplexity does not always lead to an improvement in BLEU score; for example, the success of word alignment measured by Alignment Error Rate (AER) does not often lead to an improvement in BLEU. This paper reports in the context of MT that an improvement in perplexity really leads to an improvement in BLEU score. It turned out that an application of the Hierarchical PitmanYor Language Model (HPYLM) requires a minor change in the conventional decoding process. Additionally to this, we propose a new PitmanYor processbased statistical smoothing method similar to the GoodTuring method although the performance of this is inferior to HPYLM. We conducted experiments; HPYLM improved by 1.03 BLEU points absolute and 6 % relative for 50k ENJP, which was statistically significant.