Results 1  10
of
118
An Introduction to Conditional Random Fields
 Foundations and Trends in Machine Learning
, 2012
"... ..."
(Show Context)
Using Universal Linguistic Knowledge to Guide Grammar Induction
"... We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manuallyspecified languageindependent rules that identify syntactic dependencies between pairs of syntactic categories that comm ..."
Abstract

Cited by 52 (7 self)
 Add to MetaCart
(Show Context)
We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manuallyspecified languageindependent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms stateoftheart unsupervised methods by a significant margin. 1 1
Two Decades of Unsupervised POS induction: How far have we come?
"... Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor syste ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
(Show Context)
Partofspeech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abilities by testing on both WSJ and the multilingual MultextEast corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototypedriven learner. In most cases, the prototypedriven learner outperforms the unsupervised system used to initialize it, yielding stateoftheart results on WSJ and improvements on nonEnglish corpora. 1
Unsupervised Structure Prediction with NonParallel Multilingual Guidance
"... We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel dat ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
(Show Context)
We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where humantranslated texts are unavailable. We obtain stateoftheart performance for two tasks of structure prediction: unsupervised partofspeech tagging and unsupervised dependency parsing. 1
Approximate Inference in Additive Factorial HMMs with Application to Energy Disaggregation
"... This paper considers additive factorial hidden Markov models, an extension to HMMs where the state factors into multiple independent chains, and the output is an additive function of all the hidden states. Although such models are very powerful, accurate inference is unfortunately difficult: exact i ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
This paper considers additive factorial hidden Markov models, an extension to HMMs where the state factors into multiple independent chains, and the output is an additive function of all the hidden states. Although such models are very powerful, accurate inference is unfortunately difficult: exact inference is not computationally tractable, and existing approximate inference techniques are highly susceptible to local optima. In this paper we propose an alternative inference method for such models, which exploits their additive structure by 1) looking at the observed difference signal of the observation, 2) incorporating a “robust ” mixture component that can account for unmodeled observations, and 3) constraining the posterior to allow at most one hidden state to change at a time. Combining these elements we develop a convex formulation of approximate inference that is computationally efficient, has no issues of local optima, and which performs much better than existing approaches in practice. The method is motivated by the problem of energy disaggregation, the task of taking a whole home electricity signal and decomposing it into its component appliances; applied to this task, our algorithm achieves stateoftheart performance, and is able to separate many appliances almost perfectly using just the total aggregate signal. 1
Sparsity in Dependency Grammar Induction
"... A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsityinducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other stateoftheart techniques. 1
A Hierarchical PitmanYor Process HMM for Unsupervised Part of Speech Induction
"... In this work we address the problem of unsupervised partofspeech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical PitmanYor processes prior, providing an elegant and pr ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
In this work we address the problem of unsupervised partofspeech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical PitmanYor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new typebased sampling algorithm for hierarchical PitmanYor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently outperforms the current stateoftheart across 10 languages. 1
Infinite Latent SVM for Classification and Multitask Learning
"... Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indir ..."
Abstract

Cited by 19 (12 self)
 Add to MetaCart
(Show Context)
Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes ’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multitask infinite latent support vector machines (MTiLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multitask learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both largemargin learning and Bayesian nonparametrics. 1
The Latent Maximum Entropy Principle
 In Proc. of ISIT
, 2002
"... We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of h ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of hidden variables and limited training data. We first show that solving for a latent maximum entropy model poses a hard nonlinear constrained optimization problem in general. However, we then show that feasible solutions to this problem can be obtained e#ciently for the special case of loglinear modelswhich forms the basis for an e#cient approximation to the latent maximum entropy principle. We derive an algorithm that combines expectationmaximization with iterative scaling to produce feasible loglinear solutions. This algorithm can be interpreted as an alternating minimization algorithm in the information divergence, and reveals an intimate connection between the latent maximum entropy and maximum likelihood principles.
Learning Discriminative Projections for Text Similarity Measures
"... Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, lowdimensional vector space. Our approach operates by finding the optima ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, lowdimensional vector space. Our approach operates by finding the optimal matrix to minimize the loss of the preselected similarity function (e.g., cosine) of the projected vectors, and is able to efficiently handle a large number of training examples in the highdimensional space. Evaluated on two very different tasks, crosslingual document retrieval and ad relevance measure, our method not only outperforms existing stateoftheart approaches, but also achieves high accuracy at low dimensions and is thus more efficient. 1