Results 1 -
6 of
6
Robust Accurate Statistical Annotation of General Text
, 2002
"... We describe a robust accurate domain-independent approach to statistical parsing incorporated into the new release of the ANLT toolkit, and publicly available as a research tool. The system has been used to parse many well known corpora in order to produce data for lexical acquisition efforts; it ha ..."
Abstract
-
Cited by 146 (11 self)
- Add to MetaCart
We describe a robust accurate domain-independent approach to statistical parsing incorporated into the new release of the ANLT toolkit, and publicly available as a research tool. The system has been used to parse many well known corpora in order to produce data for lexical acquisition efforts; it has also been used as a component in an open-domain question answering project. The performance of the system is competitive with that of statistical parsers using highly lexicalised parse selection models. However, we plan to extend the system to improve parse coverage, depth and accuracy.
Multiword Expressions: A Pain in the Neck for NLP
- In Proc. of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002
, 2001
"... Multiword expressions are a key problem for the development of large-scale, linguistically sound natural language processing technology. This paper surveys... ..."
Abstract
-
Cited by 111 (16 self)
- Add to MetaCart
Multiword expressions are a key problem for the development of large-scale, linguistically sound natural language processing technology. This paper surveys...
Estimation of Stochastic Attribute-Value Grammars using an Informative Sample
, 2000
"... We argue that some of the computational complexity associated with estimation of stochastic attribute- value grammars can be reduced by training upon an informative subset of the full training set. Results using the t)arsed Wall Street Journal tort)us show that in some circumstances, it is possible ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We argue that some of the computational complexity associated with estimation of stochastic attribute- value grammars can be reduced by training upon an informative subset of the full training set. Results using the t)arsed Wall Street Journal tort)us show that in some circumstances, it is possible to obtain better estimation results using au inbrmative sampie than when training upon all the available naterial. Further experimentation demonstrates that with unlexicalised models, a Gaussian prior can reduce overfitting. However, when models are lexicalised and contain overlapping features, overfitting does not seem to be a problem, and a Gaussian prior makes minimal difference to performance. Our approach is applicable for situations when there are an infeasibly large number of parses in the training set, or else Ibr when recovery of these parses fi'om a packed representation is itself comi)utationally expensive.
MDL-based DCG Induction for NP Identification
, 1999
"... We introduce a learner capable of automatically extend- ing large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle, and can be trained upon either just raw text, or else raw text additionally annotated ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We introduce a learner capable of automatically extend- ing large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle, and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that might contain pre or post- modifying phrases and might also be recursively nested) ca be identified in raw text. Preliminary results obtained by varying the amount of syntactic information in the training set suggests that raw text is less useful than additional NP bracketing information. However, using all syntactic information in the training set does not produce a significant improvement over just brack- eting information.
DCG Induction using MDL and Parsed Corpora
- Learning Language in Logic, pages 63–71, Bled,Slovenia
, 1999
"... We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon learning. Empirical ev ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon learning. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However, best results are achieved when the learner is supervised with a compressionbased prior and a parsed corpus.
Improving Dependency Parsing with Semantic Classes
"... This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting the adequate combination of semantic features on development data is key for success. Given the basic nature of the semantic classes and word sense disambiguation algorithms used, we think there is ample room for future improvements.

