Abstract:
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
Citations
|
273
|
Unsupervised word sense disambiguation rivaling supervised methods
– Yarowsky
- 1995
|
|
175
|
A method for disambiguating word senses in a large corpus
– Gale, Church, et al.
- 1993
|
|
156
|
Tagging english text with a probabilistic model
– Merialdo
- 1994
|
|
123
|
Learning to classify text from labeled and unlabeled documents
– Nigam, McCallum, et al.
- 1998
|
|
101
|
Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French
– Yarowsky
- 1994
|
|
73
|
Committee-based sampling for training probabilistic classifiers
– Dagan, P
- 1995
|
|
69
|
Classifier combination for improved lexical disambiguation.” Coling/ACL
– Brill, Wu
- 1998
|
|
50
|
A Winnow-Based Approach to Context-Sensitive Spelling Correction
– Golding, Roth
- 1999
|
|
50
|
Improving Data Driven Wordclass Tagging by System Combination
– Halteren, Zavrel, et al.
- 1993
|
|
44
|
Automatic rule acquisition for spelling correction
– Mangu, Brill
- 1997
|
|
42
|
The role of unlabeled data in supervised learning
– Mitchell
- 1999
|
|
41
|
Exploiting diversity in natural language processing: Combining parsers
– Henderson, Brill
- 1999
|
|
37
|
Combining Trigram-based and feature-based methods for contextsensitive spelling correction, ACL
– Golding, Schabes
- 1996
|
|
36
|
A Bayesian hybrid method for context-sensitive spelling correction
– Golding
- 1995
|
|
23
|
A Simple Approach to Building Ensembles of Naïve Bayesian Classifiers for Word Sense Disambiguation
– Pedersen
- 2000
|
|
15
|
Contextual spelling correction using Latent Semantic Analysis
– Jones, Martin
- 1997
|
|
13
|
Efficient lattice representation and generation
– Weng, Stolcke, et al.
- 1998
|
|
7
|
Mitigating the Paucity of Data Problem
– Banko, Brill
- 2001
|
|
1
|
Mitigating the Paucity of Data Problem. Human Language Technology
– Banko, Brill
- 2001
|
|
1
|
Heterogeneous uncertainty sampling
– D
- 1994
|
|
1
|
A sequential algorithm for training text classifiers
– D
- 1994
|