Web-based models for natural language processing (2005)
Cached
Download Links
- [www.mti.ugm.ac.id]
- [homepages.inf.ed.ac.uk]
- [homepages.inf.ed.ac.uk]
- DBLP
Other Repositories/Bibliography
| Venue: | ACM Transactions on Speech and Language Processing |
| Citations: | 48 - 0 self |
BibTeX
@ARTICLE{Lapata05web-basedmodels,
author = {Mirella Lapata and Frank Keller},
title = {Web-based models for natural language processing},
journal = {ACM Transactions on Speech and Language Processing},
year = {2005},
volume = {2},
pages = {1--31}
}
Years of Citing Articles
OpenURL
Abstract
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.







