Results 1  10
of
138
An Introduction to Conditional Random Fields
 Foundations and Trends in Machine Learning
, 2012
"... ..."
(Show Context)
Using Universal Linguistic Knowledge to Guide Grammar Induction
"... We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manuallyspecified languageindependent rules that identify syntactic dependencies between pairs of syntactic categories that comm ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
(Show Context)
We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manuallyspecified languageindependent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms stateoftheart unsupervised methods by a significant margin. 1 1
Two Decades of Unsupervised POS induction: How far have we come?
"... Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor syste ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
(Show Context)
Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abilities by testing on both WSJ and the multilingual MultextEast corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototypedriven learner. In most cases, the prototypedriven learner outperforms the unsupervised system used to initialize it, yielding stateoftheart results on WSJ and improvements on nonEnglish corpora. 1
Unsupervised Structure Prediction with NonParallel Multilingual Guidance
"... We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel dat ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where humantranslated texts are unavailable. We obtain stateoftheart performance for two tasks of structure prediction: unsupervised partofspeech tagging and unsupervised dependency parsing. 1
A Hierarchical PitmanYor Process HMM for Unsupervised Part of Speech Induction
"... In this work we address the problem of unsupervised partofspeech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical PitmanYor processes prior, providing an elegant and pr ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
In this work we address the problem of unsupervised partofspeech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical PitmanYor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new typebased sampling algorithm for hierarchical PitmanYor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently outperforms the current stateoftheart across 10 languages. 1
Sparsity in Dependency Grammar Induction
"... A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
(Show Context)
A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsityinducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other stateoftheart techniques. 1
Infinite Latent SVM for Classification and Multitask Learning
"... Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indir ..."
Abstract

Cited by 21 (12 self)
 Add to MetaCart
(Show Context)
Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes ’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multitask infinite latent support vector machines (MTiLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multitask learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both largemargin learning and Bayesian nonparametrics. 1
Learning continuous phrase representations for translation modeling
 In ACL
, 2014
"... This paper tackles the sparsity problem in estimating phrase translation probabilities by learning continuous phrase representations, whose distributed nature enables the sharing of related phrases in their representations. A pair of source and target phrases are projected into continuousvalued v ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
This paper tackles the sparsity problem in estimating phrase translation probabilities by learning continuous phrase representations, whose distributed nature enables the sharing of related phrases in their representations. A pair of source and target phrases are projected into continuousvalued vector representations in a lowdimensional latent space, where their translation score is computed by the distance between the pair in this new space. The projection is performed by a neural network whose weights are learned on parallel training data. Experimental evaluation has been performed on two WMT translation tasks. Our best result improves the performance of a stateoftheart phrasebased statistical machine translation system trained on WMT 2012 FrenchEnglish data by up to 1.3 BLEU points. 1
Learning Discriminative Projections for Text Similarity Measures
"... Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, lowdimensional vector space. Our approach operates by finding the optima ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, lowdimensional vector space. Our approach operates by finding the optimal matrix to minimize the loss of the preselected similarity function (e.g., cosine) of the projected vectors, and is able to efficiently handle a large number of training examples in the highdimensional space. Evaluated on two very different tasks, crosslingual document retrieval and ad relevance measure, our method not only outperforms existing stateoftheart approaches, but also achieves high accuracy at low dimensions and is thus more efficient. 1
The Latent Maximum Entropy Principle
 In Proc. of ISIT
, 2002
"... We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of h ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of hidden variables and limited training data. We first show that solving for a latent maximum entropy model poses a hard nonlinear constrained optimization problem in general. However, we then show that feasible solutions to this problem can be obtained e#ciently for the special case of loglinear modelswhich forms the basis for an e#cient approximation to the latent maximum entropy principle. We derive an algorithm that combines expectationmaximization with iterative scaling to produce feasible loglinear solutions. This algorithm can be interpreted as an alternating minimization algorithm in the information divergence, and reveals an intimate connection between the latent maximum entropy and maximum likelihood principles.