Results 1 - 10
of
93
cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models
- In Proceedings of ACL System Demonstrations
, 2010
"... We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including word-based models, phrase-based models, and models based on synchronous context-free grammars. Using a single unified internal representation for translat ..."
Abstract
-
Cited by 134 (53 self)
- Add to MetaCart
(Show Context)
We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including word-based models, phrase-based models, and models based on synchronous context-free grammars. Using a single unified internal representation for translation forests, the decoder strictly separates model-specific translation logic from general rescoring, pruning, and inference algorithms. From this unified representation, the decoder can extract not only the 1- or k-best translations, but also alignments to a reference, or the quantities necessary to drive discriminative training using gradient-based or gradient-free optimization techniques. Its efficient C++ implementation means that memory use and runtime performance are significantly better than comparable decoders. 1
Generating phrasal and sentential paraphrases: A survey of data-driven methods
- Computational Linguistics
, 2010
"... The task of paraphrasing is inherently familiar to speakers of all languages. Moreover, the task of automatically generating or extracting semantic equivalences for the various units of language— words, phrases, and sentences—is an important part of natural language processing (NLP) and is being inc ..."
Abstract
-
Cited by 74 (5 self)
- Add to MetaCart
The task of paraphrasing is inherently familiar to speakers of all languages. Moreover, the task of automatically generating or extracting semantic equivalences for the various units of language— words, phrases, and sentences—is an important part of natural language processing (NLP) and is being increasingly employed to improve the performance of several NLP applications. In this article, we attempt to conduct a comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research. Recent work done in manual and automatic construction of paraphrase corpora is also examined. We also discuss the strategies used for evaluating paraphrase generation techniques and briefly explore some future trends in paraphrase generation. 1.
Hierarchical phrase-based translation with suffix arrays
- In Proc. of EMNLP-CoNLL
, 2007
"... A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this prob-lem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lo ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this prob-lem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and ex-tract rules on the fly. Hierarchical phrase-based translation introduces the added wrin-kle of source phrases with gaps. Lookup algorithms used for contiguous phrases no longer apply and the best approximate pat-tern matching algorithms are much too slow, taking several minutes per sentence. We describe new lookup algorithms for hierar-chical phrase-based translation that reduce the empirical computation time by nearly two orders of magnitude, making on-the-fly lookup feasible for source phrases with gaps. 1
Following directions using statistical machine translation
- In Proceeding of the 5th ACM/IEEE international conference on Human-robot interaction, 251–258. ACM
, 2010
"... Abstract—Mobile robots that interact with humans in an intuitive way must be able to follow directions provided by humans in unconstrained natural language. In this work we investigate how statistical machine translation techniques can be used to bridge the gap between natural language route instruc ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
(Show Context)
Abstract—Mobile robots that interact with humans in an intuitive way must be able to follow directions provided by humans in unconstrained natural language. In this work we investigate how statistical machine translation techniques can be used to bridge the gap between natural language route instructions and a map of an environment built by a robot. Our approach uses training data to learn to translate from natural language instructions to an automatically-labeled map. The complexity of the translation process is controlled by taking advantage of physical constraints imposed by the map. As a result, our technique can efficiently handle uncertainty in both map labeling and parsing. Our experiments demonstrate the promising capabilities achieved by our approach. Index Terms—Human-robot interaction; instruction following; navigation; statistical machine translation; natural language I.
Using Paraphrases for Parameter Tuning in Statistical Machine Translation
"... Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their ..."
Abstract
-
Cited by 41 (10 self)
- Add to MetaCart
(Show Context)
Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In this paper, we introduce a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and we demonstrate that the resulting paraphrases can be used to drastically reduce the number of human reference translations needed for parameter tuning, without a significant decrease in translation quality. 1
A Systematic Analysis of Translation Model Search Spaces
"... Translation systems are complex, and most metrics do little to pinpoint causes of error or isolate system differences. We use a simple technique to discover induction errors, which occur when good translations are absent from model search spaces. Our results show that a common pruning heuristic dras ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Translation systems are complex, and most metrics do little to pinpoint causes of error or isolate system differences. We use a simple technique to discover induction errors, which occur when good translations are absent from model search spaces. Our results show that a common pruning heuristic drastically increases induction error, and also strongly suggest that the search spaces of phrase-based and hierarchical phrase-based models are highly overlapping despite the well known structural differences. 1
A class-based agreement model for generating accurately inflected translations
- In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL ’12
, 2012
"... When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agree ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
Automatic Generation of Bid Phrases for Online Advertising
"... One of the most prevalent online advertising methods is textual advertising. To produce a textual ad, an advertiser must craft a short creative (the text of the ad) linking to a landing page, which describes the product or service being promoted. Furthermore, the advertiser must associate the creati ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
One of the most prevalent online advertising methods is textual advertising. To produce a textual ad, an advertiser must craft a short creative (the text of the ad) linking to a landing page, which describes the product or service being promoted. Furthermore, the advertiser must associate the creative to a set of manually chosen bid phrases representing those Web search queries that should trigger the ad. For efficiency, given a landing page, the bid phrases are often chosen first, and then for each bid phrase the creative is produced using a template. Nevertheless, an ad campaign (e.g., for a large retailer) might involve thousands of landing pages and tens or hundreds of thousands of bid phrases, hence the entire process is very laborious. Our study aims towards the automatic construction of online ad campaigns: given a landing page, we propose several algorithmic methods to generate bid phrases suitable for the given input. Such phrases must be both relevant (that is, reflect the content of the page) and well-formed (that is, likely to be used as queries to a Web search engine). To this end, we use a two phase approach. First, candidate bid phrases are generated by a number of methods, including a (monolingual) translation model capable of generating phrases not contained within the text of the input as well as previously “unseen ” phrases. Second, the candidates are ranked in a probabilistic framework using both the translation model, which favors relevant phrases, as well as a bid phrase language model, which favors well-formed phrases. Empirical evaluation based on a real-life corpus of advertisercreated landing pages and associated bid phrases confirms the value of our approach, which successfully re-generates many of the human-crafted bid phrases and performs significantly better than a pure text extraction method. The research described herein was conducted while the first author was a summer intern at Yahoo! Research.
Translation as weighted deduction
- In Proc. of EACL
, 2009
"... We present a unified view of many translation algorithms that synthesizes work on deductive parsing, semiring parsing, and efficient approximate search algorithms. This gives rise to clean analyses and compact descriptions that can serve as the basis for modular implementations. We illustrate this w ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
We present a unified view of many translation algorithms that synthesizes work on deductive parsing, semiring parsing, and efficient approximate search algorithms. This gives rise to clean analyses and compact descriptions that can serve as the basis for modular implementations. We illustrate this with several examples, showing how to build search spaces for several disparate phrase-based search strategies, integrate non-local features, and devise novel models. Although the framework is drawn from parsing and applied to translation, it is applicable to many dynamic programming problems arising in natural language processing and other areas. 1