Results 1  10
of
52
Posterior Regularization for Structured Latent Variable Models
"... We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model co ..."
Abstract

Cited by 74 (7 self)
 Add to MetaCart
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multiview learning, crosslingual dependency grammar induction, unsupervised partofspeech induction, and bitext word alignment. 1
Generalized expectation criteria for semisupervised learning of conditional random fields
 In In Proc. ACL, pages 870 – 878
, 2008
"... This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distri ..."
Abstract

Cited by 69 (8 self)
 Add to MetaCart
This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distribution on unlabeled data matches a target distribution. We induce target conditional probability distributions of labels given features from both annotated feature occurrences in context and adhoc feature majority label assignment. The use of generalized expectation criteria allows for a dramatic reduction in annotation time by shifting from traditional instancelabeling to featurelabeling, and the methods presented outperform traditional CRF training and other semisupervised methods when limited human effort is available. 1
Expectation maximization and posterior constraints
 In Advances in NIPS
, 2007
"... The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables th ..."
Abstract

Cited by 54 (11 self)
 Add to MetaCart
(Show Context)
The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple apriori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
Dependency grammar induction via bitext projection constraints
 In ACLIJCNLP
, 2009
"... Broadcoverage annotated treebanks necessary to train parsers do not exist for many resourcepoor languages. The wide availability of parallel text and accurate parsers in English has opened up the possibility of grammar induction through partial transfer across bitext. We consider generative and di ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
Broadcoverage annotated treebanks necessary to train parsers do not exist for many resourcepoor languages. The wide availability of parallel text and accurate parsers in English has opened up the possibility of grammar induction through partial transfer across bitext. We consider generative and discriminative models for dependency grammar induction that use wordlevel alignments and a source language parser (English) to constrain the space of possible target trees. Unlike previous approaches, our framework does not require full projected parses, allowing partial, approximate transfer through linear expectation constraints on the space of distributions over trees. We consider several types of constraints that range from generic dependency conservation to languagespecific annotation rules for auxiliary verb analysis. We evaluate our approach on Bulgarian and Spanish CoNLL shared task data and show that we consistently outperform unsupervised methods and can outperform supervised learning for limited training data. 1
Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material
"... 1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking th ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking the derivative with respect to q(z) we have: ∂L(q, c, α, λ) = log q(z) + 1 − log p(z) − f(z) · λ (3)
Sparsity in Dependency Grammar Induction
"... A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsityinducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other stateoftheart techniques. 1
Estimating labels from label proportions
 Proceedings of the 25th Annual International Conference on Machine Learning
, 2008
"... Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like ecommerce, spam filtering and improper content detection. We present ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like ecommerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. 1
Semisupervised Semantic Role Labeling Using the Latent Words Language Model
 In Proceedings of EMNLP09
, 2009
"... Semantic Role Labeling (SRL) has proved to be a valuable tool for performing automatic analysis of natural language texts. Currently however, most systems rely on a large training set, which is manually annotated, an effort that needs to be repeated whenever different languages or a different set of ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Semantic Role Labeling (SRL) has proved to be a valuable tool for performing automatic analysis of natural language texts. Currently however, most systems rely on a large training set, which is manually annotated, an effort that needs to be repeated whenever different languages or a different set of semantic roles is used in a certain application. A possible solution for this problem is semisupervised learning, where a small set of training examples is automatically expanded using unlabeled texts. We present the Latent Words Language Model, which is a language model that learns word similarities from unlabeled texts. We use these similarities for different semisupervised SRL methods as additional features or to automatically expand a small training set. We evaluate the methods on the PropBank dataset and find that for small training sizes our best performing system achieves an error reduction of 33.27 % F1measure compared to a stateoftheart supervised baseline. 1
Multiview dimensionality reduction via canonical correlation analysis
, 2008
"... We analyze the multiview regression problem where we have two views X = (X (1) , X (2) ) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y. Crucially, thi ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We analyze the multiview regression problem where we have two views X = (X (1) , X (2) ) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y. Crucially, this projection can be computed via a Canonical Correlation Analysis only on the unlabeled data. The algorithmic template is as follows: with unlabeled data, perform CCA and construct a certain projection; with the labeled data, do least squares regression in this lower dimensional space. We show how, under certain natural assumptions, the number of labeled samples could be significantly reduced (in comparison to the single view setting) — in particular, we show how this dimensionality reduction does not loose predictive power of Y (thus it only introduces little bias but could drastically reduce the variance). We explore two separate assumptions under which this is possible and show how, under either assumption alone, dimensionality reduction could reduce the labeled sample complexity. The two assumptions we consider are a conditional independence assumption and a redundancy assumption. The typical conditional independence assumption is that conditioned on Y the views X (1) and X (2) are independent — we relax this assumption to be conditioned on some hidden state H the views X (1) and X (2) are independent. Under the redundancy assumption, we have that the In recent years, the “multiview ” approach has been receiving increasing attention as a paradigm for semisupervised
Semisupervised learning of dependency parsers using generalized expectation criteria
 IN PROC. ACL
, 2009
"... In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that penalizes the mismatch between model predictions and linguistic expectation constraints. In a comparison with two prominent “unsupervised” learning methods that require indirect biasing toward the correct syntactic structure, we show that GE can attain better accuracy with as few as 20 intuitive constraints. We also present positive experimental results on longer sentences in multiple languages.