Results 1  10
of
32
Posterior Regularization for Structured Latent Variable Models
"... We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model co ..."
Abstract

Cited by 118 (8 self)
 Add to MetaCart
(Show Context)
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multiview learning, crosslingual dependency grammar induction, unsupervised partofspeech induction, and bitext word alignment. 1
Unsupervised PartofSpeech Tagging with Bilingual GraphBased Projections
"... We describe a novel approach for inducing unsupervised partofspeech taggers for languages that have no labeled training data, but have translated text in a resourcerich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), m ..."
Abstract

Cited by 56 (6 self)
 Add to MetaCart
(Show Context)
We describe a novel approach for inducing unsupervised partofspeech taggers for languages that have no labeled training data, but have translated text in a resourcerich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resourcepoor languages. We use graphbased label propagation for crosslingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4 % over a stateoftheart baseline, and 16.7 % over vanilla hidden Markov models induced with the Expectation Maximization algorithm. 1
Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material
"... 1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking th ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking the derivative with respect to q(z) we have: ∂L(q, c, α, λ) = log q(z) + 1 − log p(z) − f(z) · λ (3)
Multilingual PartofSpeech Tagging: Two Unsupervised Approaches
"... We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsu ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
We demonstrate the effectiveness of multilingual learning for unsupervised partofspeech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised partofspeech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases. 1.
Wikily Supervised PartofSpeech Tagging
"... Despite significant recent work, purely unsupervised techniques for partofspeech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resourcerich and resourcepoor languages is one source of weak supervision that significantly ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
Despite significant recent work, purely unsupervised techniques for partofspeech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resourcerich and resourcepoor languages is one source of weak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POStaggers exceeding stateoftheart bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our approach yields better outofdomain taggers than those trained using fully supervised Penn Treebank. 1
Exploiting Spatial Context Constraints for Automatic Image Region Annotation
, 2007
"... In this paper we conduct a relatively complete study on how to exploit spatial context constraints for automated image region annotation. We present a straightforward method to regularize the segmented regions into 2D lattice layout, so that simple gridstructure graphical models can be employed to ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
In this paper we conduct a relatively complete study on how to exploit spatial context constraints for automated image region annotation. We present a straightforward method to regularize the segmented regions into 2D lattice layout, so that simple gridstructure graphical models can be employed to characterize the spatial dependencies. We show how to represent the spatial context constraints in various graphical models and also present the related learning and inference algorithms. Different from most of the existing work, we specifically investigate how to combine the classification performance of discriminative learning and the representation capability of graphical models. To reliably evaluate the proposed approaches, we create a moderate scale image set with regionlevel ground truth. The experimental results show that (i) spatial context constraints indeed help for accurate region annotation, (ii) the approaches combining the merits of discriminative learning and context constraints perform best, (iii) image retrieval can benefit from accurate regionlevel annotation.
Covariance in Unsupervised Learning of Probabilistic Grammars
"... Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learn ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, nonparallel data.
MessagePassing for Approximate MAP Inference with Latent Variables
"... We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid messagepassing algorithm to accomplish this. The hybr ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid messagepassing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and realworld datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
The shared logistic normal distribution for grammar induction
 In Proc. NAACL
, 2009
"... We present a shared logistic normal distribution as a Bayesian prior over probabilistic grammar weights. This approach generalizes the similar use of logistic normal distributions [3], enabling soft parameter tying during inference across different multinomials comprising the probabilistic grammar. ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We present a shared logistic normal distribution as a Bayesian prior over probabilistic grammar weights. This approach generalizes the similar use of logistic normal distributions [3], enabling soft parameter tying during inference across different multinomials comprising the probabilistic grammar. We show that this model outperforms previous approaches on an unsupervised dependency grammar induction task. 1
Learning Representations for Weakly Supervised Natural Language Processing Tasks
"... Information Sciences Finding the right representations for words is critical for building accurate NLP systems when domainspecific labeled data for the task is scarce. This paper investigates novel techniques for extracting features from ngram models, Hidden Markov Models, and other statistical la ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Information Sciences Finding the right representations for words is critical for building accurate NLP systems when domainspecific labeled data for the task is scarce. This paper investigates novel techniques for extracting features from ngram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on partofspeech tagging and information extraction, among other tasks, indicate that features taken from statistical language models, in combination with more traditional features, outperform traditional representations alone, and that graphical model representations outperform ngram models, especially on sparse and polysemous words. 1.