Results 1 - 10
of
29
Structured learning with approximate inference
- Advances in Neural Information Processing Systems
"... In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1
Piecewise pseudolikelihood for efficient CRF training
- In International Conference on Machine Learning (ICML
, 2007
"... Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accur ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accuracy can be poor. Piecewise training (Sutton & McCallum, 2005) can have better accuracy but does not scale as well in the variable cardinality. In this paper, we introduce piecewise pseudolikelihood, which retains the computational efficiency of pseudolikelihood but can have much better accuracy. On several benchmark NLP data sets, piecewise pseudolikelihood has better accuracy than standard pseudolikelihood, and in many cases nearly equivalent to maximum likelihood, with five to ten times less training time than batch CRF training. 1.
Polyhedral Outer Approximations with Application to Natural Language Parsing
"... Recent approaches to learning structured predictors often require approximate inference for tractability; yet its effects on the learned model are unclear. Meanwhile, most learning algorithms act as if computational cost was constant within the model class. This paper sheds some light on the first i ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Recent approaches to learning structured predictors often require approximate inference for tractability; yet its effects on the learned model are unclear. Meanwhile, most learning algorithms act as if computational cost was constant within the model class. This paper sheds some light on the first issue by establishing risk bounds for max-margin learning with LP relaxed inference and addresses the second issue by proposing a new paradigm that attempts to penalize “timeconsuming” hypotheses. Our analysis relies on a geometric characterization of the outer polyhedra associated with the LP relaxation. We then apply these techniques to the problem of dependency parsing, for which a concise LP formulation is provided that handles non-local output features. A significant improvement is shown over arc-factored models. 1.
Ltag dependency parsing with bidirectional incremental construction
- In EMNLP-2008
, 2008
"... In this paper, we first introduce a new architecture for parsing, bidirectional incremental parsing. We propose a novel algorithm for incremental construction, which can be applied to many structure learning problems in NLP. We apply this algorithm to LTAG dependency parsing, and achieve significant ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In this paper, we first introduce a new architecture for parsing, bidirectional incremental parsing. We propose a novel algorithm for incremental construction, which can be applied to many structure learning problems in NLP. We apply this algorithm to LTAG dependency parsing, and achieve significant improvement on accuracy over the previous best result on the same data set. 1
A discriminative model for tree-to-tree translation
- In Proceedings of the EMNLP
, 2006
"... This paper proposes a statistical, treeto-tree model for producing translations. Two main contributions are as follows: (1) a method for the extraction of syntactic structures with alignment information from a parallel corpus of translations, and (2) use of a discriminative, featurebased model for p ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper proposes a statistical, treeto-tree model for producing translations. Two main contributions are as follows: (1) a method for the extraction of syntactic structures with alignment information from a parallel corpus of translations, and (2) use of a discriminative, featurebased model for prediction of these targetlanguage syntactic structures—which we call aligned extended projections, or AEPs. An evaluation of the method on translation from German to English shows similar performance to the phrase-based model of Koehn et al. (2003). 1
Learning and Inference in WEIGHTED LOGIC WITH APPLICATION TO NATURAL LANGUAGE PROCESSING
, 2008
"... ..."
Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing
, 2006
"... In this thesis we develop a discriminative learning method for dependency parsing using
online large-margin training combined with spanning tree inference algorithms. We will
show that this method provides state-of-the-art accuracy, is extensible through the feature
set and can be implemented effici ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In this thesis we develop a discriminative learning method for dependency parsing using
online large-margin training combined with spanning tree inference algorithms. We will
show that this method provides state-of-the-art accuracy, is extensible through the feature
set and can be implemented efficiently. Furthermore, we display the language independent
nature of the method by evaluating it on over a dozen diverse languages as well as show its
practical applicability through integration into a sentence compression system.
We start by presenting an online large-margin learning framework that is a generaliza-
tion of the work of Crammer and Singer [34, 37] to structured outputs, such as sequences
and parse trees. This will lead to the heart of this thesis – discriminative dependency pars-
ing. Here we will formulate dependency parsing in a spanning tree framework, yielding
efficient parsing algorithms for both projective and non-projective tree structures. We will
then extend the parsing algorithm to incorporate features over larger substructures with-
out an increase in computational complexity for the projective case. Unfortunately, the
non-projective problem then becomes NP-hard so we provide structurally motivated ap-
proximate algorithms. Having defined a set of parsing algorithms, we will also define a
rich feature set and train various parsers using the online large-margin learning framework.
We then compare our trained dependency parsers to other state-of-the-art parsers on 14
diverse languages: Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, German,
Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.
Having built an efficient and accurate discriminative dependency parser, this thesis will
then turn to improving and applying the parser. First we will show how additional re-
sources can provide useful features to increase parsing accuracy and to adapt parsers to
new domains. We will also argue that the robustness of discriminative inference-based
learning algorithms lend themselves well to dependency parsing when feature representa-
tions or structural constraints do not allow for tractable parsing algorithms. Finally, we
integrate our parsing models into a state-of-the-art sentence compression system to show
its applicability to a real world problem.
Piecewise Training for Structured Prediction
- MACHINE LEARNING
"... A drawback of structured prediction methods is that parameter estimation requires repeated inference, which is intractable for general structures. In this paper, we present an approximate training algorithm called piecewise training that divides the factors into tractable subgraphs, which we call pi ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A drawback of structured prediction methods is that parameter estimation requires repeated inference, which is intractable for general structures. In this paper, we present an approximate training algorithm called piecewise training that divides the factors into tractable subgraphs, which we call pieces, that are trained independently. Piecewise training can be interpreted as approximating the exact likelihood using belief propagation, and different ways of making this interpretation yield different insights into the method. We also present an extension to piecewise training, called piecewise pseudolikelihood, designed for when variables have large cardinality. On several real-world NLP data sets, piecewise training performs superior to Besag’s pseudolikelihood and sometimes comparably to exact maximum likelihood. In addition, PWPL performs similarly to piecewise and superior to standard pseudolikelihood, but is five to ten times more computationally efficient than batch maximum likelihood training.
Samplerank: Learning preference from atomic gradients
- In NIPS WS on Advances in Ranking
, 2009
"... Large templated factor graphs with complex structure that changes during inference have been shown to provide state-of-the-art experimental results on tasks such as identity uncertainty and information integration. However, learning parameters in these models is difficult because computing the gradi ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Large templated factor graphs with complex structure that changes during inference have been shown to provide state-of-the-art experimental results on tasks such as identity uncertainty and information integration. However, learning parameters in these models is difficult because computing the gradients require expensive inference routines. In this paper we propose an online algorithm that instead learns preferences over hypotheses from the gradients between the atomic steps of inference. Although there are a combinatorial number of ranking constraints over the entire hypothesis space, a connection to the frameworks of sampled convex programs reveals a polynomial bound on the number of rankings that need to be satisfied in practice. We further apply ideas of passive aggressive algorithms to our update rules, enabling us to extend recent work in confidenceweighted classification to structured prediction problems. We compare our algorithm to structured perceptron, contrastive divergence, and persistent contrastive divergence, demonstrating substantial error reductions on two real-world problems (20 % over contrastive divergence).
Towards Understanding Situated Natural Language
"... We present a general framework and learning algorithm for the task of concept labeling: each word in a given sentence has to be tagged with the unique physical entity (e.g. person, object or location) or abstract concept it refers to. Our method allows both world knowledge and linguistic information ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present a general framework and learning algorithm for the task of concept labeling: each word in a given sentence has to be tagged with the unique physical entity (e.g. person, object or location) or abstract concept it refers to. Our method allows both world knowledge and linguistic information to be used during learning and prediction. We show experimentally that we can learn to use world knowledge to resolve ambiguities in language, such as word senses or reference resolution, without the use of handcrafted rules or features. 1

