Results 1 -
8 of
8
Building a Large Annotated Corpus of English: The Penn Treebank
- COMPUTATIONAL LINGUISTICS
, 1993
"... There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information abou ..."
Abstract
-
Cited by 1654 (9 self)
- Add to MetaCart
There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models.
In this paper, we review our experience with constructing one such large annotated corpus--the Penn Treebank, a corpus 1 consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1.
Evaluating Parsing Strategies Using Standardized Parse Files
- In Proceedings of the 3rd ACL Conference on Applied Natural Language Processing
, 1992
"... The availability of large files of manuallyreviewed parse trees from the University of Pennsylvania "tree bank", along with a program for comparing system-generated parses against these "standard" parses, provides a new opportunity for evaluating different pars- ing strategies. We discuss some ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
The availability of large files of manuallyreviewed parse trees from the University of Pennsylvania "tree bank", along with a program for comparing system-generated parses against these "standard" parses, provides a new opportunity for evaluating different pars- ing strategies. We discuss some of the restructuring required to the output of our parser so that it could be meaningfully compared with these standard parses. We then describe several heuristics for improving parsing accuracy and coverage, such as closest attachment of modifiers, statistical grammars, and fitted parses, and present a quantitative evaluation of the improvements obtained with each strategy.
The Penn Treebank: An Overview
, 2003
"... The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annot ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available http://www.ldc.upenn.edu.
Detecting Dependencies between Semantic Verb Subclasses and Subcategorization Frames in Text Corpora
, 1993
"... We present a method for individuating dependencies between the semantic class of predicates and their associated subcategorization frames, and describe an implementation which allows the cquisition of such dependencies from bracketed texts. ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We present a method for individuating dependencies between the semantic class of predicates and their associated subcategorization frames, and describe an implementation which allows the cquisition of such dependencies from bracketed texts.
An Analysis of English Punctuation: The Special Case of Comma
- John Benjamins Publishing Company. Amsterdam (The Netherlands
, 1998
"... This paper contains the details of a computer-aided exercise to investigate English punctuation practice for the special case of comma (the most significant punctuation mark) in a parsed corpus. The study classifies the various `structural' uses of comma according to the syntaxpatterns in which comm ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper contains the details of a computer-aided exercise to investigate English punctuation practice for the special case of comma (the most significant punctuation mark) in a parsed corpus. The study classifies the various `structural' uses of comma according to the syntaxpatterns in which comma occurs. The corpus (Penn Treebank) consists of syntactically annotated sentences with no part-of-speech tag information about the individual words.
A Corpus-based Probabilistic Unification Grammar
- Proceedings Esslli Workshop
"... This paper describes an attempt at creating a robust parsing system for the SCHISMA task domain. We describe how a probabilistic unification grammar is generated from a corpus of utterances collected in Wizard of Oz experiments. We tagged the corpus with syntactic categories and superficial syntacti ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper describes an attempt at creating a robust parsing system for the SCHISMA task domain. We describe how a probabilistic unification grammar is generated from a corpus of utterances collected in Wizard of Oz experiments. We tagged the corpus with syntactic categories and superficial syntactic structure using Standard Generalised Markup Language (SGML). From the annotated data thus obtained a probabilistic unification grammar was generated. The grammar was then tested on `seen' and `unseen' data from the same domain using a probabilistic leftcorner parser for PATRII unification grammars. We will evaluate the quality and size of the corpus from a syntactic point of view, describe the grammar we obtained, and report on the performance of the parsing system when applied to unseen data from the same domain.
ADVANCES IN AUTOMATIC TERMINOLOGY PROCESSING: METHODOLOGY AND APPLICATION IN FOCUS
, 2007
"... This work or any part thereof has not previously been presented in any form to the University or to any other institutional body whether for assessment, publication, or for other purposes. Save for any express acknowledgements, references and/or bibliographies cited in the work, I confirm that the i ..."
Abstract
- Add to MetaCart
This work or any part thereof has not previously been presented in any form to the University or to any other institutional body whether for assessment, publication, or for other purposes. Save for any express acknowledgements, references and/or bibliographies cited in the work, I confirm that the intellectual content of the work is the result of my own efforts and of no other person. The right of Le An Ha to be identified as author of this work is asserted in accordance with ss.77 and 78 of the Copyright, Designs and Patents Act 1988. At this date, copyright is owned by the author. Signature……. Date….. The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload ’ would be to employ automatic or semi-automatic procedures
Linguistics With Enriching Statistics: Performance Models Of Natural Language
- University of Amsterdam
, 1995
"... Ó 1995 by Rens Bod. All rights reserved. Printed in the Netherlands by Academische Pers, Amsterdam. Acknowledgements This thesis benefitted from discussions with many people. I would like to express my thanks to Martin van den Berg, Kenneth Church, Marc Dymetman, Bipin Indurkhya, Laszlo Kalman, Ron ..."
Abstract
- Add to MetaCart
Ó 1995 by Rens Bod. All rights reserved. Printed in the Netherlands by Academische Pers, Amsterdam. Acknowledgements This thesis benefitted from discussions with many people. I would like to express my thanks to Martin van den Berg, Kenneth Church, Marc Dymetman, Bipin Indurkhya, Laszlo Kalman, Ronald Kaplan, Martin Kay, Steven Krauwer, Kwee Tjoe Liong, Neza van der Leeuw, David Magerman, Arie Mijnlieff, Fernando Pereira, Philip Resnik, Yves Schabes, Khalil Sima'an and Frederik Somsen. Furthermore, I wish to thank the members of the graduation committee: Renate Bartsch, Jan van Eijck, Gerard Kempen, Chris Klaassen and Anton Nijholt. I am grateful to Steven Krauwer for allowing me to work at this thesis while I was involved in the CLASK project ("Combining Linguistic and Statistical Knowledge") at Utrecht University. The fruitful discussions and positive cooperation with my colleagues Martin van den Berg and Khalil Sima'an have been of incalculable value

