Results 1 - 10
of
10
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
Robust Large-Scale EBMT with Marker-Based Segmentation
- In Proceedings of the Tenth Conference on Theoretical and Methodological Issues in Machine Translation (TMI-04
, 2004
"... Previous work on marker-based EBMT [Gough & Way, 2003, Way & Gough, 2004] suffered from problems such as data-sparseness and disparity between the training and test data. We have developed a largescale robust EBMT system. In a comparison with the systems listed in [Somers, 2003], ours is the third l ..."
Abstract
-
Cited by 26 (13 self)
- Add to MetaCart
Previous work on marker-based EBMT [Gough & Way, 2003, Way & Gough, 2004] suffered from problems such as data-sparseness and disparity between the training and test data. We have developed a largescale robust EBMT system. In a comparison with the systems listed in [Somers, 2003], ours is the third largest EBMT system and certainly the largest English-French EBMT system. Previous work used the on-line MT system Logomedia to translate source language material as a means of populating the system’s database where bitexts were unavailable. We derive our sententially aligned strings from a Sun Translation Memory (TM) and limit the integration of Logomedia to the derivation of our word-level lexicon. We also use Logomedia to provide a baseline comparison for our system and observe that we outperform Logomedia and previous marker-based EBMT systems in a number of tests. 1
Data-Oriented Models of Parsing and Translation
, 2005
"... A dissertation submitted in fulfilment of the requirements for the award of ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
A dissertation submitted in fulfilment of the requirements for the award of
Controlled Generation in Example-Based Machine Translation
- In Proceedings of the Ninth Machine Translation Summit (MT Summit IX
, 2003
"... The theme of controlled translation is currently in vogue in the area of MT. Recent research (Sch aler et al., 2003; Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present an EBMT system where the generation of the target string is fi ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The theme of controlled translation is currently in vogue in the area of MT. Recent research (Sch aler et al., 2003; Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present an EBMT system where the generation of the target string is filtered by data written according to controlled language specifications. As far as we are aware, this is the only research available on this topic. In the field of controlled language applications, it is more usual to constrain the source language in this way rather than the target. We translate a small corpus of controlled English into French using the on-line MT system Logomedia, and seed the memories of our EBMT system with a set of automatically induced lexical resources using the Marker Hypothesis as a segmentation tool. We test our system on a large set of sentences extracted from a Sun Translation Memory, and provide both an automatic and a human evaluation. For comparative purposes, we also provide results for Logomedia itself.
A study of using search engine page hits as a proxy for n-gram frequencies
- In Proceedings of the RANLP’05
, 2005
"... The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the i ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined. 1
TransBooster: Boosting the Performance of Wide-coverage Machine Translation Systems
- Proceedings of the Tenth EAMT Workshop
, 2005
"... Abstract. We propose the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems based on reliable and fast automatic decomposition of the translation input and corresponding c ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. We propose the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems based on reliable and fast automatic decomposition of the translation input and corresponding composition of translation output. We provide details of our method, and experimental results compared to the MT systems SYSTRAN and Logomedia. While many avenues for further experimentation remain, to date we fall just behind the baseline systems on the full 800-sentence testset, but in certain cases our method causes the translation quality obtained via the MT systems to improve. 1.
Orthographic errors in web pages: Towards cleaner web corpora
- Computational Lingusitics
, 2006
"... Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number o ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web. 1.
Harvesting the Bitexts of the Laws of Hong Kong From the Web
"... In this paper we present our recent work on harvesting English-Chinese bitexts of the laws of Hong Kong from the Web and aligning them to the subparagraph level via utilizing the numbering system in the legal text hierarchy. Basic methodology and practical techniques are reported in detail. The resu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we present our recent work on harvesting English-Chinese bitexts of the laws of Hong Kong from the Web and aligning them to the subparagraph level via utilizing the numbering system in the legal text hierarchy. Basic methodology and practical techniques are reported in detail. The resultant bilingual corpus, 10.4M English words and 18.3M Chinese characters, is an authoritative and comprehensive text collection covering the specific and special domain of HK laws. It is particularly valuable to empirical MT research. This piece of work has also laid a foundation for exploring and harvesting English-Chinese bitexts in a larger volume from the Web. 1
Ninth Machine Translation Summit
, 2003
"... Machine translation and computer-aided translation have become key technologies in the presentday globalised communications scene. These are truly cross-disciplinary technologies which should not be used without a certain level of understanding. As a result, many universities and academic institutio ..."
Abstract
- Add to MetaCart
Machine translation and computer-aided translation have become key technologies in the presentday globalised communications scene. These are truly cross-disciplinary technologies which should not be used without a certain level of understanding. As a result, many universities and academic institutions teach courses on MT and related language technologies, both at graduate and at undergraduate levels. Courses may be aimed on the one hand at translation and linguistics majors,

