Results 1  10
of
43
Posterior Regularization for Structured Latent Variable Models
"... We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model co ..."
Abstract

Cited by 67 (7 self)
 Add to MetaCart
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multiview learning, crosslingual dependency grammar induction, unsupervised partofspeech induction, and bitext word alignment. 1
Generalized expectation criteria for semisupervised learning of conditional random fields
 In In Proc. ACL, pages 870 – 878
, 2008
"... This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distri ..."
Abstract

Cited by 64 (8 self)
 Add to MetaCart
This paper presents a semisupervised training method for linearchain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distribution on unlabeled data matches a target distribution. We induce target conditional probability distributions of labels given features from both annotated feature occurrences in context and adhoc feature majority label assignment. The use of generalized expectation criteria allows for a dramatic reduction in annotation time by shifting from traditional instancelabeling to featurelabeling, and the methods presented outperform traditional CRF training and other semisupervised methods when limited human effort is available. 1
Expectation maximization and posterior constraints
 In Advances in NIPS
, 2007
"... The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables th ..."
Abstract

Cited by 50 (11 self)
 Add to MetaCart
The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple apriori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
Dependency grammar induction via bitext projection constraints
 In ACLIJCNLP
, 2009
"... Broadcoverage annotated treebanks necessary to train parsers do not exist for many resourcepoor languages. The wide availability of parallel text and accurate parsers in English has opened up the possibility of grammar induction through partial transfer across bitext. We consider generative and di ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Broadcoverage annotated treebanks necessary to train parsers do not exist for many resourcepoor languages. The wide availability of parallel text and accurate parsers in English has opened up the possibility of grammar induction through partial transfer across bitext. We consider generative and discriminative models for dependency grammar induction that use wordlevel alignments and a source language parser (English) to constrain the space of possible target trees. Unlike previous approaches, our framework does not require full projected parses, allowing partial, approximate transfer through linear expectation constraints on the space of distributions over trees. We consider several types of constraints that range from generic dependency conservation to languagespecific annotation rules for auxiliary verb analysis. We evaluate our approach on Bulgarian and Spanish CoNLL shared task data and show that we consistently outperform unsupervised methods and can outperform supervised learning for limited training data. 1
Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material
"... 1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking th ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
1.1 Derivation of the ℓ1/ℓ ∞ dual program We want to optimize the objective: The Lagrangian becomes: min q,cwt KL(qp) + σ ∑ s. t. Eq[fwti] ≤ cwt 0 ≤ cwt L(q, c, α, λ) = KL(qp) + σ ∑ cwt + ∑ λwti(Eq[fwti] − cwt) − α · c (2) wt where we are maximizing with respect to λ ≥ 0 and α ≥ 0. Taking the derivative with respect to q(z) we have: ∂L(q, c, α, λ) = log q(z) + 1 − log p(z) − f(z) · λ (3)
Sparsity in Dependency Grammar Induction
"... A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsityinducing penalties on the posterior distributions of parentchild POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsityinducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other stateoftheart techniques. 1
Multiview dimensionality reduction via canonical correlation analysis
, 2008
"... We analyze the multiview regression problem where we have two views X = (X (1) , X (2) ) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y. Crucially, thi ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We analyze the multiview regression problem where we have two views X = (X (1) , X (2) ) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y. Crucially, this projection can be computed via a Canonical Correlation Analysis only on the unlabeled data. The algorithmic template is as follows: with unlabeled data, perform CCA and construct a certain projection; with the labeled data, do least squares regression in this lower dimensional space. We show how, under certain natural assumptions, the number of labeled samples could be significantly reduced (in comparison to the single view setting) — in particular, we show how this dimensionality reduction does not loose predictive power of Y (thus it only introduces little bias but could drastically reduce the variance). We explore two separate assumptions under which this is possible and show how, under either assumption alone, dimensionality reduction could reduce the labeled sample complexity. The two assumptions we consider are a conditional independence assumption and a redundancy assumption. The typical conditional independence assumption is that conditioned on Y the views X (1) and X (2) are independent — we relax this assumption to be conditioned on some hidden state H the views X (1) and X (2) are independent. Under the redundancy assumption, we have that the In recent years, the “multiview ” approach has been receiving increasing attention as a paradigm for semisupervised
SemiSupervised Random Forests ∗
"... Random Forests (RFs) have become commonplace in many computer vision applications. Their popularity is mainly driven by their high computational efficiency during both training and evaluation while still being able to achieve stateoftheart accuracy. This work extends the usage of Random Forests t ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
Random Forests (RFs) have become commonplace in many computer vision applications. Their popularity is mainly driven by their high computational efficiency during both training and evaluation while still being able to achieve stateoftheart accuracy. This work extends the usage of Random Forests to SemiSupervised Learning (SSL) problems. We show that traditional decision trees are optimizing multiclass margin maximizing loss functions. From this intuition, we develop a novel multiclass margin definition for the unlabeled data, and an iterative deterministic annealingstyle training algorithm maximizing both the multiclass margin of labeled and unlabeled samples. In particular, this allows us to use the predicted labels of the unlabeled data as additional optimization variables. Furthermore, we propose a control mechanism based on the outofbag error, which prevents the algorithm from degradation if the unlabeled data is not useful for the task. Our experiments demonstrate stateoftheart semisupervised learning performance in typical machine learning problems and constant improvement using unlabeled data for the Caltech101 object categorization task. 1.
Semisupervised learning of dependency parsers using generalized expectation criteria
 IN PROC. ACL
, 2009
"... In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that penalizes the mismatch between model predictions and linguistic expectation constraints. In a comparison with two prominent “unsupervised” learning methods that require indirect biasing toward the correct syntactic structure, we show that GE can attain better accuracy with as few as 20 intuitive constraints. We also present positive experimental results on longer sentences in multiple languages.
Semisupervised convex training for dependency parsing
 In Proceedings of ACL08: HLT
, 2008
"... We present a novel semisupervised training algorithm for learning dependency parsers. By combining a supervised large margin loss with an unsupervised least squares loss, a discriminative, convex, semisupervised learning algorithm can be obtained that is applicable to largescale problems. To demo ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We present a novel semisupervised training algorithm for learning dependency parsers. By combining a supervised large margin loss with an unsupervised least squares loss, a discriminative, convex, semisupervised learning algorithm can be obtained that is applicable to largescale problems. To demonstrate the benefits of this approach, we apply the technique to learning dependency parsers from combined labeled and unlabeled corpora. Using a stochastic gradient descent algorithm, a parsing model can be efficiently learned from semisupervised data that significantly outperforms corresponding supervised methods. 1