Results 1  10
of
33
Learning DocumentLevel Semantic Properties from Freetext Annotations
"... This paper demonstrates a new method for leveraging unstructured annotations to infer semantic document properties. We consider the domain of product reviews, which are often annotated by their authors with freetext keyphrases, such as “a real bargain ” or “good value. ” We leverage these unstructu ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
This paper demonstrates a new method for leveraging unstructured annotations to infer semantic document properties. We consider the domain of product reviews, which are often annotated by their authors with freetext keyphrases, such as “a real bargain ” or “good value. ” We leverage these unstructured annotations by clustering them into semantic properties, and then tying the induced clusters to hidden topics in the document text. This allows us to predict relevant properties of unannotated documents. Our approach is implemented in a hierarchical Bayesian model with joint inference, which increases the robustness of the keyphrase clustering and encourages document topics to correlate with semantically meaningful properties. We perform several evaluations of our model, and find that it substantially outperforms alternative approaches. 1
Distributional Representations for Handling Sparsity in Supervised SequenceLabeling
"... Supervised sequencelabeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
Supervised sequencelabeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the training set. We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words. We incorporate aspects of these representations into the feature space of our sequencelabeling systems. In an experiment on a standard chunking dataset, our best technique improves a chunker from 0.76 F1 to 0.86 F1 on chunks beginning with rare words. On the same dataset, it improves our partofspeech tagger from 74 % to 80 % accuracy on rare words. Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling. 1
Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material
"... 1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking th ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking the derivative with respect to q(z) we have: ∂L(q, c, α, λ) = log q(z) + 1 − log p(z) − f(z) · λ (3)
Multilingual PartofSpeech Tagging: Two Unsupervised Approaches
"... We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsu ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised partofspeech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases. 1.
Two Decades of Unsupervised POS induction: How far have we come?
"... Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor syste ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abilities by testing on both WSJ and the multilingual MultextEast corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototypedriven learner. In most cases, the prototypedriven learner outperforms the unsupervised system used to initialize it, yielding stateoftheart results on WSJ and improvements on nonEnglish corpora. 1
Unsupervised Multilingual Learning for POS Tagging
"... We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a hierarchical Bayesian model for jointly predic ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a hierarchical Bayesian model for jointly predicting bilingual streams of partofspeech tags. The model learns languagespecific features while capturing crosslingual patterns in tag distribution for aligned words. Once the parameters of our model have been learned on bilingual parallel data, we evaluate its performance on a heldout monolingual test set. Our evaluation on six pairs of languages shows consistent and significant performance gains over a stateoftheart monolingual baseline. For one language pair, we observe a relative reduction in error of 53%. 1
Em can find pretty good hmm postaggers (when given a good start
 In Proc. ACL
, 2008
"... We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EMHMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(tw). We test ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EMHMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(tw). We test the method on the task of full morphological disambiguation in Hebrew achieving an error reduction of 25 % over a strong uniform distribution baseline. We also test the same method on the standard WSJ unsupervised POS tagging task and obtain results competitive with recent stateoftheart methods, while using simple and efficient learning methods. 1
Covariance in Unsupervised Learning of Probabilistic Grammars
"... Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learn ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, nonparallel data.
Variational Inference for Adaptor Grammars
"... Adaptor grammars extend probabilistic contextfree grammars to define prior distributions over trees with “rich get richer” dynamics. Inference for adaptor grammars seeks to find parse trees for raw text. This paper describes a variational inference algorithm for adaptor grammars, providing an alter ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Adaptor grammars extend probabilistic contextfree grammars to define prior distributions over trees with “rich get richer” dynamics. Inference for adaptor grammars seeks to find parse trees for raw text. This paper describes a variational inference algorithm for adaptor grammars, providing an alternative to Markov chain Monte Carlo methods. To derive this method, we develop a stickbreaking representation of adaptor grammars, a representation that enables us to define adaptor grammars with recursion. We report experimental results on a word segmentation task, showing that variational inference performs comparably to MCMC. Further, we show a significant speedup when parallelizing the algorithm. Finally, we report promising results for a new application for adaptor grammars, dependency grammar induction.
The shared logistic normal distribution for grammar induction
 In Proc. NAACL
, 2009
"... We present a shared logistic normal distribution as a Bayesian prior over probabilistic grammar weights. This approach generalizes the similar use of logistic normal distributions [3], enabling soft parameter tying during inference across different multinomials comprising the probabilistic grammar. ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present a shared logistic normal distribution as a Bayesian prior over probabilistic grammar weights. This approach generalizes the similar use of logistic normal distributions [3], enabling soft parameter tying during inference across different multinomials comprising the probabilistic grammar. We show that this model outperforms previous approaches on an unsupervised dependency grammar induction task. 1