Results 1  10
of
54
Contrastive estimation: Training loglinear models on unlabeled data
 In Proc. of ACL
, 2005
"... Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are loglinear, allowing the incorporation of arbitrary features into the model. To train on unlabele ..."
Abstract

Cited by 119 (14 self)
 Add to MetaCart
Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are loglinear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for loglinear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features. 1
Alpino: Widecoverage Computational Analysis of Dutch
 In
, 2000
"... Alpino is a widecoverage computational analyzer of Dutch which aims at accurate, full, parsing of unrestricted text. We describe the headdriven lexicalized grammar and the lexical component, which has been derived from existing resources. The grammar produces dependency structures, thus providing ..."
Abstract

Cited by 72 (11 self)
 Add to MetaCart
Alpino is a widecoverage computational analyzer of Dutch which aims at accurate, full, parsing of unrestricted text. We describe the headdriven lexicalized grammar and the lexical component, which has been derived from existing resources. The grammar produces dependency structures, thus providing a reasonably abstract and theoryneutral level of linguistic representation. An important aspect of widecoverage parsing is robustness and disambiguation. The dependency relations encoded in the dependency structures have been used to develop and evaluate both handcoded and statistical disambiguation methods.
Learning for Semantic Parsing with Statistical Machine Translation
, 2006
"... We present a novel statistical approach to semantic parsing, WASP, for constructing a complete, formal meaning representation of a sentence. A semantic parser is learned given a set of sentences annotated with their correct meaning representations. The main innovation of WASP is its use of stateof ..."
Abstract

Cited by 46 (7 self)
 Add to MetaCart
We present a novel statistical approach to semantic parsing, WASP, for constructing a complete, formal meaning representation of a sentence. A semantic parser is learned given a set of sentences annotated with their correct meaning representations. The main innovation of WASP is its use of stateoftheart statistical machine translation techniques. A word alignment model is used for lexical acquisition, and the parsing model itself can be seen as a syntaxbased translation model. We show that WASP performs favorably in terms of both accuracy and coverage compared to existing learning methods requiring similar amount of supervision, and shows better robustness to variations in task complexity and word order.
Feature Forest Models for Probabilistic HPSG Parsing
 In Computational Linguistics
, 2008
"... Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into substructures under the assu ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into substructures under the assumption of statistical independence among substructures. For example, partofspeech tagging of a sentence is decomposed into tagging of each word, and CFGparsing is split into applications of CFGrules. These methods have relied on the structure of the target problem, namely lattices or trees, and cannot be applied to graph structures including typed feature structures. This article proposes the feature forest model as a solution to the problem of probabilistic modeling of complex data structures including typed feature structures. The feature forest model provides a method for probabilistic modeling without the independence assumption when probabilistic events are represented with feature forests. Feature forests are generic data structures that represent ambiguous trees in a packed forest structure. Feature forest models are maximum entropy models defined over feature forests. A dynamic programming algorithm is proposed for maximum entropy estimation without unpacking feature forests. Thus probabilistic modeling of
Inducing German Semantic Verb Classes from Purely Syntactic Subcategorisation Information
 In Proceedings of the 40th Annual Meeting of the ACL
, 2002
"... The paper describes the application of kMeans, a standard clustering technique, to the task of inducing semantic classes for German verbs. Using probability distributions over verb subcategorisation frames, we obtained an intuitively plausible clustering of 57 verbs into 14 classes. ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
The paper describes the application of kMeans, a standard clustering technique, to the task of inducing semantic classes for German verbs. Using probability distributions over verb subcategorisation frames, we obtained an intuitively plausible clustering of 57 verbs into 14 classes.
Probabilistic Syntax
, 2002
"... istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no linguistic reason), (ii) Probabilistic models don't model grammaticality (neither Colorless green ideas sleep furiously nor Furiously sleep ideas green colorless have previously been uttered  and hence must be estimated to have probability zero, Chomsky wrongly assumes  but the former is grammatical while the latter is not, and (iii) Use of probabilities does not meet the goal of describing the mindinternal Ilanguage as opposed to the observedintheworld Elanguage. This chapter is not meant to be a detailed critique of Chomsky's arguments  Abney (1996) provides a survey and a rebuttal, and Pereira (2000) has further useful discussion  but some of these concerns are still importa
Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
, 2006
"... This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likel ..."
Abstract

Cited by 28 (8 self)
 Add to MetaCart
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations)
Automatic FStructure Annotation Of Treebank Trees
 THE FIFTH INTERNATIONAL CONFERENCE ON LEXICALFUNCTIONAL GRAMMAR, THE UNIVERSITY OF CALIFORNIA AT BERKELEY, 19 JULY  20 JULY 2000, CSLI
, 2000
"... We describe a method that automatically induces LFG fstructures from treebank tree representations, given a set of fstructure annotation principles that define partial, modular c to fstructure correspondences in a linguistically informed, principlebased way. ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
We describe a method that automatically induces LFG fstructures from treebank tree representations, given a set of fstructure annotation principles that define partial, modular c to fstructure correspondences in a linguistically informed, principlebased way.
Guiding unsupervised grammar induction using contrastive estimation
 In Proc. of IJCAI Workshop on Grammatical Inference Applications
, 2005
"... We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihoodbased objective functions. This criterion is a generalization ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihoodbased objective functions. This criterion is a generalization of the function maximized by the ExpectationMaximization algorithm [Dempster et al., 1977]. CE is a natural fit for loglinear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, loglinear dependency grammar models trained using CE can drastically outperform EMtrained generative models on the task of matching human linguistic annotations (the MATCHLINGUIST task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood one can target the objective of grammar induction to a specific application. 1
Probabilistic models of nonprojective dependency trees
 In Proc. EMNLPCoNLL
, 2007
"... A notable gap in research on statistical dependency parsing is a proper conditional probability distribution over nonprojective dependency trees for a given sentence. We exploit the Matrix Tree Theorem (Tutte, 1984) to derive an algorithm that efficiently sums the scores of all nonprojective trees i ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
A notable gap in research on statistical dependency parsing is a proper conditional probability distribution over nonprojective dependency trees for a given sentence. We exploit the Matrix Tree Theorem (Tutte, 1984) to derive an algorithm that efficiently sums the scores of all nonprojective trees in a sentence, permitting the definition of a conditional loglinear model over trees. While discriminative methods, such as those presented in McDonald et al. (2005b), obtain very high accuracy on standard dependency parsing tasks and can be trained and applied without marginalization, “summing trees ” permits some alternative techniques of interest. Using the summing algorithm, we present competitive experimental results on four nonprojective languages, for maximum conditional likelihood estimation, minimum Bayesrisk parsing, and hidden variable training. 1