MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Scaling to Very Very Large Corpora for Natural Language Disambiguation (2001) [49 citations — 2 self]

by Michele Banko ,  Eric Brill
Add To MetaCart

Abstract:

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

Citations

273 Unsupervised word sense disambiguation rivaling supervised methods – Yarowsky - 1995
175 A method for disambiguating word senses in a large corpus – Gale, Church, et al. - 1993
156 Tagging english text with a probabilistic model – Merialdo - 1994
123 Learning to classify text from labeled and unlabeled documents – Nigam, McCallum, et al. - 1998
101 Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French – Yarowsky - 1994
73 Committee-based sampling for training probabilistic classifiers – Dagan, P - 1995
69 Classifier combination for improved lexical disambiguation.” Coling/ACL – Brill, Wu - 1998
50 A Winnow-Based Approach to Context-Sensitive Spelling Correction – Golding, Roth - 1999
50 Improving Data Driven Wordclass Tagging by System Combination – Halteren, Zavrel, et al. - 1993
44 Automatic rule acquisition for spelling correction – Mangu, Brill - 1997
42 The role of unlabeled data in supervised learning – Mitchell - 1999
41 Exploiting diversity in natural language processing: Combining parsers – Henderson, Brill - 1999
37 Combining Trigram-based and feature-based methods for contextsensitive spelling correction, ACL – Golding, Schabes - 1996
36 A Bayesian hybrid method for context-sensitive spelling correction – Golding - 1995
23 A Simple Approach to Building Ensembles of Naïve Bayesian Classifiers for Word Sense Disambiguation – Pedersen - 2000
15 Contextual spelling correction using Latent Semantic Analysis – Jones, Martin - 1997
13 Efficient lattice representation and generation – Weng, Stolcke, et al. - 1998
7 Mitigating the Paucity of Data Problem – Banko, Brill - 2001
1 Mitigating the Paucity of Data Problem. Human Language Technology – Banko, Brill - 2001
1 Heterogeneous uncertainty sampling – D - 1994
1 A sequential algorithm for training text classifiers – D - 1994