Results 1 - 10
of
14
Building a Large Annotated Corpus of English: The Penn Treebank
- COMPUTATIONAL LINGUISTICS
, 1993
"... There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information abou ..."
Abstract
-
Cited by 1654 (9 self)
- Add to MetaCart
There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models.
In this paper, we review our experience with constructing one such large annotated corpus--the Penn Treebank, a corpus 1 consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1.
Noun Classification From Predicate.argument Structures
, 1990
"... A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potent ..."
Abstract
-
Cited by 198 (0 self)
- Add to MetaCart
A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification.
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches
, 1993
"... As large on-line corpora become more prewlent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any specific application, comparing the results of these attempts is difficult. Here we propose an e ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
As large on-line corpora become more prewlent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any specific application, comparing the results of these attempts is difficult. Here we propose an ewluation method using gold standards, i.e., pre-existing hand-compiled resources, as a means of comparing extraction techniques. Using this ewluation method, we compare two semantic extraction techniques which produce similar word lists, one using syntactic context of words , and the'other using windows of heuristiclly tagged words. The two techniques are very similar except that in one case selective natural language processing, a partial syntactic analysis, is performed. On a 4 megabyte corpus, syntactic contexts produce significantly better results against the gold standards for the most characteristic words in the corpus, while windows produce better results for rare words.
The Penn Treebank: An Overview
, 2003
"... The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annot ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available http://www.ldc.upenn.edu.
Corpus-Based Acquisition of Relative Pronoun Disambiguation Heuristics
, 1992
"... This paper presents a corpus-based approach for deriving heuristics to locate the antecedents of relative pronouns. The technique duplicates the performance of hand-coded rules and requires human intervention only during the training phase. Because the training instances are built on parser output r ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
This paper presents a corpus-based approach for deriving heuristics to locate the antecedents of relative pronouns. The technique duplicates the performance of hand-coded rules and requires human intervention only during the training phase. Because the training instances are built on parser output rather than word coocctnrences, the technique requires a small number of training examples and can be used on small to medium-sized corpora. Our initial results suggest that the approach may provide a general method for the automated acquisition of a variety of disambiguation heuristics for natural language systems, especially for problems that require the assimilation of syntactic and semantic knowledge.
A Syntax-Based Part-of-Speech Analyser
- IN EACL-95
, 1995
"... There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguis- tic and the data"driven. Recent state-of- the-art part-of-speech taggers are based on the data"driven approach. Because of the known feasibility of the linguistic rule-based approach at ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguis- tic and the data"driven. Recent state-of- the-art part-of-speech taggers are based on the data"driven approach. Because of the known feasibility of the linguistic rule-based approach at related levels of description, the success of the data" driven approach in part-of-speech analysis may appear surprising. In this paper, a case is made for the syntactic nature of part-of-speech tagging. A new tagger of English that uses only linguistic distributional rules is outlined and empirically evaluated. Tested against a benchmark corpus of 38,000 words of previously unseen text, this syntax-based system reaches an accuracy of above 99%. Compared to the 95-97% accuracy of its best competitors, this result suggests the feasibility of the linguistic approach also in part-of-speech analysis.
Rapid Grammar Development and Parsing: Constraint Dependency Grammars with Abstract Role Values
, 2000
"... ROLE VALUES A Thesis Submitted to the Faculty Purdue University by Christopher M. White In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy May 2000 - ii - To my loving wife Margit. ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
ROLE VALUES A Thesis Submitted to the Faculty Purdue University by Christopher M. White In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy May 2000 - ii - To my loving wife Margit.

