• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Domain Specific Knowledge Acquisition for Conceptual Sentence Analysis (1994)

by Claire Cardie
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 25
Next 10 →

MBT: A Memory-Based Part of Speech Tagger-Generator

by Walter Daelemans , Jakub Zavrel, Peter Berck, Steven Gillis - PROC. OF FOURTH WORKSHOP ON VERY LARGE CORPORA , 1996
"... We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approac ..."
Abstract - Cited by 168 (47 self) - Add to MetaCart
We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger. Based on such a corpus, the tagger-generator automatically builds a tagger which is able to tag new text the same way, diminishing development time for the construction of a tagger considerably. Memory-based tagging shares this advantage with other statistical or machine learning approaches. Additional advantages specific to a memory-based approach include (i) the relatively small tagged corpus size sufficient for training, (ii) incremental learning, (iii) explanation capabilities, (iv) flexible integration of information in case representations, (v) its non-parametric nature, (vi) reasonably good results on unknown words without morphological analysis, and (vii) fast learning and tagging. In this paper we show that a large-scale application of the memory-based approach is feasible: we obtain a tagging accuracy that is on a par with that of known statistical approaches, ad with attractive space and time complexity properties when using IGTree, a tree-based formalism for indexing and searching huge case bases. The use of IGTree has as additional advantage that optimal context size for disambiguation is dynamically computed.

Forgetting Exceptions is Harmful in Language Learning

by Walter Daelemans, Antal van den Bosch, Jakub Zavrel - MACHINE LEARNING, SPECIAL ISSUE ON NATURAL LANGUAGE LEARNING , 1999
"... We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, pa ..."
Abstract - Cited by 94 (38 self) - Add to MetaCart
We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

Memory-Based Shallow Parsing

by Walter Daelemans, Sabine Buchholz, Jorn Veenstra - In Proceedings of CoNLL , 1999
"... We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as nemory-based modules. The experiments reported in this paper show competitive results, the Fa= for the Wall Street Journal (WSJ) treebank i ..."
Abstract - Cited by 66 (13 self) - Add to MetaCart
We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as nemory-based modules. The experiments reported in this paper show competitive results, the Fa= for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking, 94.7% for VP chunking, 77.1% fox' subject detection and 79.0% for object detection.

Automating Feature Set Selection for Case-Based Learning of Linguistic Knowledge

by Claire Cardie , 1996
"... This paper addresses the issue of "algorithm vs. representation" for case-based learning of linguistic knowledge. We first present empirical evidence that the success of case-based learning methods for natural language processing tasks depends to a large degree on the feature set used to describe th ..."
Abstract - Cited by 29 (0 self) - Add to MetaCart
This paper addresses the issue of "algorithm vs. representation" for case-based learning of linguistic knowledge. We first present empirical evidence that the success of case-based learning methods for natural language processing tasks depends to a large degree on the feature set used to describe the training instances. Next, we present a technique for automating feature set selection for case-based learning of linguistic knowledge. Given as input a baseline case representation, the method modifies the representation in response to a number of predefined linguistic biases by adding, deleting, and weighting features appropriately. We apply the linguistic bias approach to feature set selection to the problem of relative pronoun disambiguation and show that the casebased learning agorithm improves as relevant biases are incorporated into the underlying instance representation. Finally, we argue that the linguistic bias approach to feature set selection offers new possibilities for case-based learning of natural language: it simplifies the process of instance representation design and, in theory, obviates the need for separate instance representations for each linguistic knowledge acquisition task. More importantly, the approach offers a mechanism for explicitly combining the frequency information available from corpus-based techniques with linguistic bias information employed in traditional linguistic and knowledge-based approaches to natural language processing.

An Environment for Morphosyntactic Processing of Unrestricted Spanish Text

by J. Carmona, S. Cervell, L. Màrquez, M. A. Martí, L. Padró, R. Placer, H. Rodríguez, M. Taulé, J. Turmo , 1998
"... We present in this paper a fast, broad-coverage, accurate morphological analyzer for Spanish words, MACO+, which is an extended and improved version of that described in (Acebo et al., 1994). The earlier version had two main flaws: it was not transportable, and it was too slow to enable massive text ..."
Abstract - Cited by 24 (6 self) - Add to MetaCart
We present in this paper a fast, broad-coverage, accurate morphological analyzer for Spanish words, MACO+, which is an extended and improved version of that described in (Acebo et al., 1994). The earlier version had two main flaws: it was not transportable, and it was too slow to enable massive text processing. The presented system not only overcomes those two flaws, but also offers improved coverage and accuracy. We also present two general part-of-speech taggers, which can be used to disambiguate the output of the morphological analyzer. All modules run in any Unix/Linux machine as a pipeline process and they may also be used inside the GATE environment for NLP (Cunningham et al., 1996). The system is currently being used to annotate the LexEsp corpus, a 5.5 million word corpus of Spanish, in a bootstrapping refining procedure. Initial evaluation and results are reported. Keywords: Morphological analysis, corpus linguistics, POS tagging, linguistic resources. 1 Introduction and Mot...

Memory-Based Learning: Using Similarity for Smoothing

by Jakub Zavrel , Walter Daelemans , 1997
"... This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the ..."
Abstract - Cited by 23 (7 self) - Add to MetaCart
This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations.

Careful Abstraction from Instance Families in Memory-Based Language Learning

by Antal van den Bosch - Journal for Experimental and Theoretrical Artificial Intelligence , 1999
"... ion from Instance Families in Memory-Based Language Learning Antal van den Bosch ILK Research Group, Computational Linguistics Tilburg University, The Netherlands email: Antal.vdnBosch@kub.nl Contact: Antal van den Bosch ILK Research Group / Computational Linguistics Faculty of Arts Tilburg Universi ..."
Abstract - Cited by 12 (6 self) - Add to MetaCart
ion from Instance Families in Memory-Based Language Learning Antal van den Bosch ILK Research Group, Computational Linguistics Tilburg University, The Netherlands email: Antal.vdnBosch@kub.nl Contact: Antal van den Bosch ILK Research Group / Computational Linguistics Faculty of Arts Tilburg University P.O. Box 90153 NL-5000 LE Tilburg The Netherlands phone (voice) +31.13.4668260 phone (fax) +31.13.4663110 Running heading: Careful abstraction from instance families Abstract Empirical studies in inductive language learning point at pure memory-based learning as a successful approach to many language learning tasks, often performing better than lerning methods that abstract from the learning material. The possibility is left open, however, that limited, careful abstraction in memory-based learning may be harmless to generalisation, as long as the disjunctivity of language data is preserved. We compare three types of careful abstraction: editing, oblivious (partial) decision-tree abstra...

Embedded Machine Learning Systems for Natural Language Processing: A General Framework

by Claire Cardie - Lecture Notes in Artificial Intelligence Series , 1996
"... . This paper presents Kenmore, a general framework for knowledge acquisition for natural language processing (NLP) systems. To ease the acquisition of knowledge in new domains, Kenmore exploits an online corpus using robust sentence analysis and embedded symbolic machine learning techniques while re ..."
Abstract - Cited by 11 (2 self) - Add to MetaCart
. This paper presents Kenmore, a general framework for knowledge acquisition for natural language processing (NLP) systems. To ease the acquisition of knowledge in new domains, Kenmore exploits an online corpus using robust sentence analysis and embedded symbolic machine learning techniques while requiring only minimal human intervention. By treating all problems in ambiguity resolution as classification tasks, the framework uniformly addresses a range of subproblems in sentence analysis, each of which traditionally had required a separate computational mechanism. In a series of experiments, we demonstrate the successful use of Kenmore for learning solutions to several problems in lexical and structural ambiguity resolution. We argue that the learning and knowledge acquisition components should be embedded components of the NLP system in that (1) learning should take place within the larger natural language understanding system as it processes text, and (2) the learning components shou...

Resolving Part-of-Speech Ambiguity in the Greek Language Using Learning Techniques

by Georgios Petasis, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos - In "Proceedings of the ECCAI Advanced Course on Artificial Intelligence (ACAI , 1999
"... This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different t ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different test cases: a corpus on “management succession events ” and a general-theme corpus. The two experiments show that the performance of this method does not depend on the thematic domain of the corpus, and its accuracy for the Greek language is around 95%.

Instance-Family Abstraction in Memory-Based Language Learning

by Antal van den Bosch - Machine Learning: Proceedings of the Sixteenth International Conference , 1999
"... ion in Memory-Based Language Learning Antal van den Bosch ILK / Computational Linguistics Tilburg University The Netherlands Antal.vdnBosch@kub.nl Abstract Memory-based learning appears relatively successful when the learning data is highly disjunct, i.e., when classes are scattered over many smal ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
ion in Memory-Based Language Learning Antal van den Bosch ILK / Computational Linguistics Tilburg University The Netherlands Antal.vdnBosch@kub.nl Abstract Memory-based learning appears relatively successful when the learning data is highly disjunct, i.e., when classes are scattered over many small families of instances in instance space, as in many language learning tasks. Abstraction over borders of disjuncts tends to harm generalization performance. However, careful abstraction in memory-based learning may be harmless when it preserves the disjunctivity of the learning data. We investigate the effect of careful abstraction in a series of language-learning task studies, and a small benchmark-task study. We find that when combined with feature weighting or value-distance metrics, careful abstraction, as implemented in the new fambl algorithm, can equal the generalization accuracies of pure memory-based learning, while attaining fair levels of memory compression. 1 INTRODUCTION Memo...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University