Results 1 - 10
of
41
Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
, 2004
"... In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure simila ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring insequence n-grams automatically. The second method relaxes strict n-gram matching to skipbigram matching. Skip-bigram is any pair of words in their sentence order. Skip-bigram cooccurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. The empirical results show that both methods correlate with human judgments very well in both adequacy and fluency.
Computing Consensus Translation from Multiple Machine Translation Systems Using Enhanced Hypotheses Alignment
- Cambridge University Engineering Department
, 2006
"... This paper describes a novel method for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The outputs are combined and a possibly new translation hypothesis can be generated. Similarly to the well-established ROVER approach of (Fiscus, 1997) for ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
This paper describes a novel method for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The outputs are combined and a possibly new translation hypothesis can be generated. Similarly to the well-established ROVER approach of (Fiscus, 1997) for combining speech recognition hypotheses, the consensus translation is computed by voting on a confusion network. To create the confusion network, we produce pairwise word alignments of the original machine translation hypotheses with an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole document of translations rather than a single sentence is taken into account to produce the alignment.
Using machine translation evaluation techniques to determine sentence-level semantic equivalence
- In IWP2005
, 2005
"... The task of machine translation (MT) evaluation is closely related to the task of sentence-level semantic equivalence classification. This paper investigates the utility of applying standard MT evaluation methods (BLEU, NIST, WER and PER) to building classifiers to predict semantic equivalence and e ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The task of machine translation (MT) evaluation is closely related to the task of sentence-level semantic equivalence classification. This paper investigates the utility of applying standard MT evaluation methods (BLEU, NIST, WER and PER) to building classifiers to predict semantic equivalence and entailment. We also introduce a novel classification method based on PER which leverages part of speech information of the words contributing to the word matches and non-matches in the sentence. Our results show that MT evaluation techniques are able to produce useful features for paraphrase classification and to a lesser extent entailment. Our technique gives a substantial improvement in paraphrase classification accuracy over all of the other models used in the experiments. 1
On the integration of speech recognition and statistical machine translation
- Proc. European Conf. on Speech Communication and Technology
, 2005
"... This paper focuses on the interface between speech recognition and machine translation in a speech translation system. Based on a thorough theoretical framework, we exploit word lattices of automatic speech recognition hypotheses as input to our translation system which is based on weighted finite-s ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper focuses on the interface between speech recognition and machine translation in a speech translation system. Based on a thorough theoretical framework, we exploit word lattices of automatic speech recognition hypotheses as input to our translation system which is based on weighted finite-state transducers. We show that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality. In experiments, we have found consistent improvements on three different corpora in comparison with translations of single best recognized results. In addition we build and evaluate a fully integrated speech translation model. 1.
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
, 2004
"... Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson’s product moment correlation coefficient or Spearman’s rank order correlation coefficient between human scores and automatic scores. However, such co ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson’s product moment correlation coefficient or Spearman’s rank order correlation coefficient between human scores and automatic scores. However, such comparisons rely on human judgments of translation qualities such as adequacy and fluency. Unfortunately, these judgments are often inconsistent and very expensive to acquire. In this paper, we introduce a new evaluation method, ORANGE, for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations. We also show the results of comparing several existing automatic metrics and three new automatic metrics using ORANGE.
Finding the System that Suits you Best: Towards the Normalization of MT Evaluation
- In ASLIB (27th International Conference on Translating and the Computer
, 2005
"... disparate metrics and methods which have been devised for MT and helps evaluators to design an evaluation plan based on the context of use intended for the system. FEMTI allows therefore the generation of more standardized and reusable evaluation plans. By evaluators we mean not only developers and ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
disparate metrics and methods which have been devised for MT and helps evaluators to design an evaluation plan based on the context of use intended for the system. FEMTI allows therefore the generation of more standardized and reusable evaluation plans. By evaluators we mean not only developers and programmers, but also end users, managers, and anyone else with a stake in the acquisition or deployment of a system. Thus, the use of FEMTI is not limited to experts in the field of MT. In this paper we describe FEMTI and the latest enhancements we are making to it, in particular the interfaces which not only allow evaluators to create their own tailor-made evaluation plans, but also to contribute their experience and expertise in constantly improving the resource for the community at large. 1
Rapid Language Model Development Using External Resources for New Spoken Dialog Domains
- in Proc. ICASSP, 2005
"... This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. We consider the worst yet quite common scenario where only a small amount (∼1.7K utterances) of domain specific data is available for the target domain. We present a new method that exploits external static text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from World Wide Web (WWW). We show that language models built using external resources can jointly be used with limited in–domain (baseline) language model to obtain significant improvements in speech recognition accuracy. Combining language models built using external resources with the in–domain language model provides over 20 % reduction in WER over the baseline in–domain language model. Equivalently, we achieve almost the same level of performance by having ten times as much in–domain data (17K utterances). 1.
Scaling the ISLE framework: Use of existing corpus resources for validation of MT evaluation metrics across languages
- In Proceedings of LREC 2002. Las Plamas, Canary Islands
, 2002
"... This paper describes a machine translation (MT) evaluation (MTE) research program which has benefited from the availability of two collections of source language texts and the results of processing these texts with several commercial MT engines (DARPA 1994, Doyon, Taylor, & White 1999). The methodo ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper describes a machine translation (MT) evaluation (MTE) research program which has benefited from the availability of two collections of source language texts and the results of processing these texts with several commercial MT engines (DARPA 1994, Doyon, Taylor, & White 1999). The methodology entails the systematic development of a predictive relationship between discrete, well-defined MTE metrics and specific information processing tasks that can be reliably performed with output of a given MT system. Unlike tests used in initial experiments on automated scoring (Jones and Rusk 2000), we employ traditional measures of MT output quality, selected from the International Standards for Language Engineering (ISLE) framework: Coherence, Clarity, Syntax, Morphology, General and Domain-specific Lexical robustness, to include Named-entity translation. Each test was originally validated on MT output produced by three Spanish-to-English systems (1994 DARPA MTE). We validate tests in the present work, however, with material taken from the MT Scale Evaluation research program produced by Japanese-to-English MT systems. Since Spanish and Japanese differ structurally on the morphological, syntactic, and discourse levels, a comparison of scores on tests measuring these output qualities should reveal how structural similarity, such as that enjoyed by Spanish and English, and structural contrast, such as that found between Japanese and English, affect the linguistic distinctions which must be accommodated by MT systems. Moreover, we show that metrics developed using Spanish-English MT output are equally effective when applied to Japanese-English MT output. 1.
Extrinsic Evaluation of Automatic Metrics for Summarization
, 2004
"... This paper describes extrinsic-task evaluation of summarization. We show that it is possible to save time using summaries for relevance assessment without adversely impacting the degree of accuracy that would be possible with full documents. In addition, we demonstrate that the extrinsic task we hav ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper describes extrinsic-task evaluation of summarization. We show that it is possible to save time using summaries for relevance assessment without adversely impacting the degree of accuracy that would be possible with full documents. In addition, we demonstrate that the extrinsic task we have selected exhibits a high degree of interannotator agreement, i.e., consistent relevance decisions across subjects. We also conducted a composite experiment that better reflects the actual document selection process and found that using a surrogate improves the processing speed over reading the entire document. Finally, we have found a small yet statistically significant correlation between some of the intrinsic measures and a user's performance in an extrinsic task. The overall conclusion we can draw at this point is that ROUGE-1 does correlate with precision and to a somewhat lesser degree with accuracy, but that it remains to be investigated how stable these correlations are and how differences in ROUGE-1 translate into significant differences in human performance in an extrinsic task.
Biology Based Alignments of Paraphrases for Sentence Compression
"... univ-orleans.fr 1 In this paper, we present a study for extracting and aligning paraphrases in the context of Sentence Compression. First, we justify the application of a new measure for the automatic extraction of paraphrase corpora. Second, we discuss the work done by (Barzilay & Lee, 2003) who us ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
univ-orleans.fr 1 In this paper, we present a study for extracting and aligning paraphrases in the context of Sentence Compression. First, we justify the application of a new measure for the automatic extraction of paraphrase corpora. Second, we discuss the work done by (Barzilay & Lee, 2003) who use clustering of paraphrases to induce rewriting rules. We will see, through classical visualization methodologies (Kruskal & Wish, 1977) and exhaustive experiments, that clustering may not be the best approach for automatic pattern identification. Finally, we will provide some results of different biology based methodologies for pairwise paraphrase alignment. 1

