Results 1 - 10
of
33
An Introduction to Machine Translation
, 1992
"... Abstract. In the last ten years there has been a significant amount of research in Machine Translation within a “new ” paradigm of empirical approaches, often labelled collectively as “Example-based” approaches. The first manifestation of this approach caused some surprise and hostility among observ ..."
Abstract
-
Cited by 276 (7 self)
- Add to MetaCart
Abstract. In the last ten years there has been a significant amount of research in Machine Translation within a “new ” paradigm of empirical approaches, often labelled collectively as “Example-based” approaches. The first manifestation of this approach caused some surprise and hostility among observers more used to different ways of working, but the techniques were quickly adopted and adapted by many researchers, often creating hybrid systems. This paper reviews the various research efforts within this paradigm reported to date, and attempts a categorisation of different manifestations of the general approach.
Using the Web to Obtain Frequencies for Unseen Bigrams
- Computational Linguistics
, 2003
"... This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: ( ..."
Abstract
-
Cited by 104 (2 self)
- Add to MetaCart
This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of Web frequencies in a pseudodisambiguation task. 1.
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of nlp tasks
- In Proc. of Human Language Technologies - North American Chapter of the Association for Computational Linguistics (HLT-NAACL
, 2004
"... Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-theart models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models. 1
Using the Web to Overcome Data Sparseness
- In Proceedings of EMNLP-02
, 2002
"... This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstratin ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstrating that web frequencies and correlate with frequencies obtained from a carefully edited, balanced corpus.
New Tools for Web-Scale N-grams
"... While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally-similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges. 1.
wEBMT: Developing and Validating an Example-Based Machine Translation System using the World Wide Web
- COMPUTATIONAL LINGUISTICS
, 2003
"... ..."
Bilingual Terminology Mining – Using Brain, not brawn comparable corpora
"... univ-nantes.fr Current research in text mining favours the quantity of texts over their quality. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted registe ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
univ-nantes.fr Current research in text mining favours the quantity of texts over their quality. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the quality rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the quality of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. We show how important the type of discourse is as a characteristic of the comparable corpus. 1
Concordancing the Web with KWiCFinder
, 2001
"... This paper outlines tools and techniques to exploit the Web as a vast linguistic and cultural corpus. After a snapshot of the dimensions and diversity of the Web in late 2001, this paper surveys the capabilities and shortcomings of major web search engines and characterizes typical user search behav ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper outlines tools and techniques to exploit the Web as a vast linguistic and cultural corpus. After a snapshot of the dimensions and diversity of the Web in late 2001, this paper surveys the capabilities and shortcomings of major web search engines and characterizes typical user search behavior. Next it describes KWiCFinder, a free search tool optimized for language professionals programmed by the author to make online research more efficient and effective. KWiCFinder automatically conducts a search, retrieves online documents matching the user's query, and produces a keyword in context concordance of those documents for rapid evaluation. Then the author details many of the considerations which went into designing KWiCFinder and the challenges of maintaining it. Next the usefulness of the Web as a source of linguistic data is discussed and illustrated with a number of examples. The article closes with a discussion of future directions for web concordancing.
2006. Interpretation of compound nominalisations using corpus and web statistics
- In ACL Workshop on Multiword Expressions
"... We present two novel paraphrase tests for automatically predicting the inherent semantic relation of a given compound nominalisation as one of subject, direct object, or prepositional object. We compare these to the usual verb–argument paraphrase test using corpus statistics, and frequencies obtaine ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We present two novel paraphrase tests for automatically predicting the inherent semantic relation of a given compound nominalisation as one of subject, direct object, or prepositional object. We compare these to the usual verb–argument paraphrase test using corpus statistics, and frequencies obtained by scraping the Google search engine interface. We also implemented a more robust statistical measure than maximum likelihood estimation — the confidence interval. A significant reduction in data sparseness was achieved, but this alone is insufficient to provide a substantial performance improvement. 1

