Results 1 - 10
of
15
CoNLL-X shared task on multilingual dependency parsing
- In Proc. of CoNLL
, 2006
"... Each year the Conference on Computational Natural Language Learning (CoNLL) 1 features a shared task, in which participants train and test their systems on exactly the same data sets, in order to better compare systems. The tenth CoNLL (CoNLL-X) saw a shared task on Multilingual Dependency Parsing. ..."
Abstract
-
Cited by 161 (2 self)
- Add to MetaCart
Each year the Conference on Computational Natural Language Learning (CoNLL) 1 features a shared task, in which participants train and test their systems on exactly the same data sets, in order to better compare systems. The tenth CoNLL (CoNLL-X) saw a shared task on Multilingual Dependency Parsing. In this paper, we describe how treebanks for 13 languages were converted into the same dependency format and how parsing performance was measured. We also give an overview of the parsing approaches that participants took and the results that they achieved. Finally, we try to draw general conclusions about multi-lingual parsing: What makes a particular language, treebank or annotation scheme easier or harder to parse and which phenomena are challenging for any dependency parser? Acknowledgement Many thanks to Amit Dubey and Yuval Krymolowski, the other two organizers of the shared task, for discussions, converting treebanks, writing software and helping with the papers. 2
Relationalrealizational parsing
- In Proc. of CoLING
, 2008
"... State-of-the-art statistical parsing models applied to free word-order languages tend to underperform compared to, e.g., parsing English. Constituency-based models often fail to capture generalizations that cannot be stated in structural terms, and dependency-based models employ a ‘single-head ’ ass ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
State-of-the-art statistical parsing models applied to free word-order languages tend to underperform compared to, e.g., parsing English. Constituency-based models often fail to capture generalizations that cannot be stated in structural terms, and dependency-based models employ a ‘single-head ’ assumption that often breaks in the face of multiple exponence. In this paper we suggest that the position of a constituent is a form manifestation of its grammatical function, one among various possible means of realization. We develop the Relational-Realizational approach to parsing in which we untangle the projection of grammatical functions and their means of realization to allow for phrase-structure variability and morphological-syntactic interaction. We empirically demonstrate the application of our approach to parsing Modern Hebrew, obtaining 7 % error reduction from previously reported results. 1
Em can find pretty good hmm pos-taggers (when given a good start
- In Proc. ACL
, 2008
"... We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(t|w). We test ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(t|w). We test the method on the task of full morphological disambiguation in Hebrew achieving an error reduction of 25 % over a strong uniform distribution baseline. We also test the same method on the standard WSJ unsupervised POS tagging task and obtain results competitive with recent state-ofthe-art methods, while using simple and efficient learning methods. 1
Can you tag the modal? you should
- In Proceeding of COLING-ACL-07
, 2007
"... Computational linguistics methods are typically first developed and tested in English. When applied to other languages, assumptions from English data are often applied to the target language. One of the most common such assumptions is that a “standard” part-of-speech (POS) tagset can be used across ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Computational linguistics methods are typically first developed and tested in English. When applied to other languages, assumptions from English data are often applied to the target language. One of the most common such assumptions is that a “standard” part-of-speech (POS) tagset can be used across languages with only slight variations. We discuss in this paper a specific issue related to the definition of a POS tagset for Modern Hebrew, as an example to clarify the method through which such variations can be defined. It is widely assumed
A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing
"... Morphological processes in Semitic languages deliver space-delimited words which introduce multiple, distinct, syntactic units into the structure of the input sentence. These words are in turn highly ambiguous, breaking the assumption underlying most parsers that the yield of a tree for a given sent ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Morphological processes in Semitic languages deliver space-delimited words which introduce multiple, distinct, syntactic units into the structure of the input sentence. These words are in turn highly ambiguous, breaking the assumption underlying most parsers that the yield of a tree for a given sentence is known in advance. Here we propose a single joint model for performing both morphological segmentation and syntactic disambiguation which bypasses the associated circularity. Using a treebank grammar, a data-driven lexicon, and a linguistically motivated unknown-tokens handling technique our model outperforms previous pipelined, integrated or factorized systems for Hebrew morphological and syntactic processing, yielding an error reduction of 12% over the best published results so far. 1
Three-Dimensional Parametrization for Parsing Morphologically Rich
"... Current parameters of accurate unlexicalized parsers based on Probabilistic Context-Free Grammars (PCFGs) form a twodimensional grid in which rewrite events are conditioned on both horizontal (headoutward) and vertical (parental) histories. In Semitic languages, where arguments may move around rathe ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Current parameters of accurate unlexicalized parsers based on Probabilistic Context-Free Grammars (PCFGs) form a twodimensional grid in which rewrite events are conditioned on both horizontal (headoutward) and vertical (parental) histories. In Semitic languages, where arguments may move around rather freely and phrasestructures are often shallow, there are additional morphological factors that govern the generation process. Here we propose that agreement features percolated up the parse-tree form a third dimension of parametrization that is orthogonal to the previous two. This dimension differs from mere “state-splits ” as it applies to a whole set of categories rather than to individual ones and encodes linguistically motivated co-occurrences between them. This paper presents extensive experiments with extensions of unlexicalized PCFGs for parsing Modern Hebrew in which tuning the parameters in three dimensions gradually leads to improved performance. Our best result introduces a new, stronger, lower bound on the performance of treebank grammars for parsing Modern Hebrew, and is on a par with current results for parsing Modern Standard Arabic obtained by a fully lexicalized parser trained on a much larger treebank.
Quality control of treebanks: documenting, converting, patching
- IN LREC 2006 WORKSHOP ON
, 2006
"... We report about our experiences with using many different syntactically annotated corpora (treebanks). We list various types of format and annotation errors we have noticed and propose common sense as well as novel ways to prevent, detect and handle these. We show how the quality of a treebank’s ann ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We report about our experiences with using many different syntactically annotated corpora (treebanks). We list various types of format and annotation errors we have noticed and propose common sense as well as novel ways to prevent, detect and handle these. We show how the quality of a treebank’s annotation and its documentation are related and how the concepts of patching and versioning that come from the software community can be applied to treebanks in order to improve quality.
Towards Unifying Perception and Cognition: The Ubiquity of Trees. Prepublication
, 2005
"... Is there a single mechanism that underlies all perceptual and cognitive processing? This paper aims to solve a small part of Newell's challenge (A. Newell 1990, Unified Theories of Cognition, Harvard University Press) and proposes a model that unifies three different modalities: language, music and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Is there a single mechanism that underlies all perceptual and cognitive processing? This paper aims to solve a small part of Newell's challenge (A. Newell 1990, Unified Theories of Cognition, Harvard University Press) and proposes a model that unifies three different modalities: language, music and problem-solving. In doing so, we will focus on tree structures. Trees are ubiquitous in modeling high-level perception and cognition and have been used to represent grouping structures in linguistic, musical and visual perception and deductive structures in reasoning, learning and problem solving. We will show that an instantiation of the Data-Oriented Parsing (DOP) framework can accurately predict the correct tree structure for linguistic utterances, musical pieces and physics problems. The key idea of the DOP framework is that new input is analyzed by combining subtrees from a representative corpus of previous trees. While the labeling of the trees and the details of the combination operation may differ across the modalities, we argue that there is one model for predicting the tree that humans come up with. We report on experiments with manually annotated corpora for the three modalities, showing that the best performing model is the one which takes into account subtrees of arbitrary size and which selects the most probable tree from among the shortest derivations of an input.
Cross-Framework Evaluation for Statistical Parsing
"... A serious bottleneck of comparative parser evaluation is the fact that different parsers subscribe to different formal frameworks and theoretical assumptions. Converting outputs from one framework to another is less than optimal as it easily introduces noise into the process. Here we present a princ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A serious bottleneck of comparative parser evaluation is the fact that different parsers subscribe to different formal frameworks and theoretical assumptions. Converting outputs from one framework to another is less than optimal as it easily introduces noise into the process. Here we present a principled protocol for evaluating parsing results across frameworks based on function trees, tree generalization and edit distance metrics. This extends a previously proposed framework for cross-theory evaluation and allows us to compare a wider class of parsers. We demonstrate the usefulness and language independence of our procedure by evaluating constituency and dependency parsers on English and Swedish. 1
Tagging a Hebrew Corpus: The Case of Participles
"... We report on an effort to build a corpus of Modern Hebrew tagged with parts of speech and morphology. We designed a tagset specific to Hebrew while focusing on 4 aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We report on an effort to build a corpus of Modern Hebrew tagged with parts of speech and morphology. We designed a tagset specific to Hebrew while focusing on 4 aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags assigned to maintain consistency; the tagset should be useful for machine taggers and learning algorithms; and the tagset should be effective for applications relying on the tags as input features. In this paper, we illustrate these issues by explaining our decision to introduce a tag for participles in Hebrew. We explain how this tag is defined, and how it helped us improve the manual tagging accuracy to a high-level, while improving automatic tagging and helping in the task of syntactic chunking. 1

