Results 11 - 20
of
48
Soft Syntactic Constraints for Hierarchical Phrased-Based Translation
"... In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a context-free translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We obtain substantial improvements in performance for translation from Chinese and Arabic to English. 1
Building Minority Language Corpora by Learning to Generate Web Search Queries
- Knowledge and Information Systems
, 2000
"... The Web is an obvious source of valuable information but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents matching a minority concept. We use the concept o ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
The Web is an obvious source of valuable information but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents matching a minority concept. We use the concept of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or non-relevant using a language filter and the feedback is used to learn what query-lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. We find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our systems performs well whether we initialize it with a whole document, or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions. 1.
Mining the Web to Create Minority Language Corpora
, 2001
"... The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from ps ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages. 1.
High-Performance Bilingual Text Alignment Using Statistical And Dictionary Information
- In Proceedings of Annual Conference of the Association for Computational Linguistics
, 1996
"... This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method make ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of general use. The other is the word correspondences that are statistically acquired in the alignment process. Our method gradually determines sentence pairs (anchors) that correspond to each other by relaxing parameters. The method, by combining two kinds of word correspondences, achieves adequate word correspondences for complete alignment. As a result, texts of various length and of various genres in structurally different languages can be aligned with high precision. Experimental results show our system outperforms conventional methods for various kinds of Japanese-English texts.
Finitestate-based and phrase-based statistical machine translation
- Proc. of the 8th Int. Conf. on Spoken Language Processing, ICSLP’04
, 2004
"... This paper shows the common framework that underlies the translation systems based on phrases or driven by finite state transducers, and summarizes a first comparison between them. In both approaches the translation process is based on pairs of source and target strings of words (segments) related b ..."
Abstract
-
Cited by 14 (11 self)
- Add to MetaCart
This paper shows the common framework that underlies the translation systems based on phrases or driven by finite state transducers, and summarizes a first comparison between them. In both approaches the translation process is based on pairs of source and target strings of words (segments) related by word alignment. Their main difference comes from the statistical modeling of the translation context. The experimental study has been carried out on an English/Spanish version of the VERB-MOBIL corpus. Under the constrain of a monotone composition of translated segments to generate the target sentence, the finite state based translation outperforms the phrase based counterpart. 1.
Mitre’s submission to the eu pascal rte challenge
- In PASCAL. Proc. of the First Challenge Workshop. Recognizing Textual Entailment
, 2005
"... We describe MITRE’s two submissions to the RTE Challenge, intended to exemplify two different ends of the spectrum of possibilities. The first submission is a traditional system based on linguistic analysis and inference, while the second is inspired by alignment approaches from machine translation. ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We describe MITRE’s two submissions to the RTE Challenge, intended to exemplify two different ends of the spectrum of possibilities. The first submission is a traditional system based on linguistic analysis and inference, while the second is inspired by alignment approaches from machine translation. We also describe our efforts to build our own entailment corpus. Finally, we discuss our investigations and reflections on the strengths and weaknesses of the evaluation itself. 1
Lexicalist Machine Translation of Spatial Prepositions
, 1995
"... This thesis proposes a strongly lexicalist approach to machine translation and applies it to the translation of spatial prepositions and prepositional expressions between English and Spanish. Bilingual contrastive knowledge resides solely in the bilingual lexicon and is structured in the form of cor ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This thesis proposes a strongly lexicalist approach to machine translation and applies it to the translation of spatial prepositions and prepositional expressions between English and Spanish. Bilingual contrastive knowledge resides solely in the bilingual lexicon and is structured in the form of correspondences between sets of source and target language lexemes related through indices. The resulting architecture maximizes the independence of the monolingual and bilingual components. This independence is demonstrated by developing a grammar of Spanish which is significantly different in its constructions from its analogous English grammar. In particular, relative clauses are analysed through a single rule that allows gaps in subject position, while clitic climbing and doubling are handled through mechanisms not normally found in grammatical descriptions of English. Bilingual lexical rules, in conjunction with the bilingual lexicon, constitute a single, motivated and well defined mechani...
Distributed Latent Variable Models of Lexical Co-occurrences
- IN PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE AND STATISTICS
, 2005
"... Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The model takes
Statistical machine reordering
- In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
, 2006
"... Reordering is currently one of the most important problems in statistical machine translation systems. This paper presents a novel strategy for dealing with it: statistical machine reordering (SMR). It consists in using the powerful techniques developed for statistical machine translation (SMT) to t ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Reordering is currently one of the most important problems in statistical machine translation systems. This paper presents a novel strategy for dealing with it: statistical machine reordering (SMR). It consists in using the powerful techniques developed for statistical machine translation (SMT) to translate the source language (S) into a reordered source language (S’), which allows for an improved translation into the target language (T). The SMT task changes from S2T to S’2T which leads to a monotonized word alignment and shorter translation units. In addition, the use of classes in SMR helps to infer new word reorderings. Experiments are reported in the EsEn WMT06 tasks and the ZhEn IWSLT05 task and show significant improvement in translation quality. 1
Online Learning Methods For Discriminative Training of Phrase Based Statistical Machine Translation
"... This paper investigates the task of training discriminatively a phrase based SMT system with millions of features using the structured perceptron and the Margin Infused Relax Algorithm (MIRA), two popular online learning algorithms. We also compare two different update strategies, one where we updat ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
This paper investigates the task of training discriminatively a phrase based SMT system with millions of features using the structured perceptron and the Margin Infused Relax Algorithm (MIRA), two popular online learning algorithms. We also compare two different update strategies, one where we update towards an oracle translation candidate extracted from an N-best list vs a more aggressive approach in which we update towards an oracle extracted prior to training using a minloss decoder. We evaluate our different training algorithms on the Czech-English translation task. Our results show that while both learning algorithms achieve similar results, with the perceptron converging more rapidly, the aggressive update strategy performs significantly worse than the more conservative strategy corroborating Liang et al. (2006)’s findings. 1.

