• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Aligning A Parallel English-Chinese Corpus Statistically With Lexical Criteria (1994)

by Dekai Wu
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 48
Next 10 →

Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora

by Dekai Wu , 1997
"... ..."
Abstract - Cited by 343 (20 self) - Add to MetaCart
Abstract not found

A Polynomial-Time Algorithm for Statistical Machine Translation

by Dekai Wu - In 34th Annual Meeting of the Association for Computational Linguistics , 1996
"... We introduce a polynomial-time algorithm for statistical machine translation. This algorithm can be used in place of the expensive, slow best-first search strategies in current statistical translation architectures. ..."
Abstract - Cited by 68 (6 self) - Add to MetaCart
We introduce a polynomial-time algorithm for statistical machine translation. This algorithm can be used in place of the expensive, slow best-first search strategies in current statistical translation architectures.

Improving machine translation performance by exploiting non-parallel corpora

by Dragos Stefan Munteanu, Daniel Marcu - Computational Linguistics , 2005
"... We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large ..."
Abstract - Cited by 56 (2 self) - Add to MetaCart
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available. 1.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora

by Pascale Fung - IN PROCEEDINGS OF THE 33RD ANNUAL CONFERENCE OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS , 1995
"... We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/IndcEuropean language pairs. Tagging information of one guage is used. Word frequency and position information for high and low frequency words are represent ..."
Abstract - Cited by 54 (5 self) - Add to MetaCart
We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/IndcEuropean language pairs. Tagging information of one guage is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise elimination techniques are introduced. We obtained a 73.1% precision. We also show how the results can be used in the compilation of domain-specific noun phrases.

A Geometric Approach to Mapping Bitext Correspondence

by I. Dan Melamed , 1996
"... NLP work is to construct a detailed map of the correspondence between a text and its translation. Several auto- matic methods for this task have been proposed in recent years. Yet even the best of these methods can err by several typeset pages. The Smooth Injective Map Recognizer (SIMR) is a new bit ..."
Abstract - Cited by 48 (13 self) - Add to MetaCart
NLP work is to construct a detailed map of the correspondence between a text and its translation. Several auto- matic methods for this task have been proposed in recent years. Yet even the best of these methods can err by several typeset pages. The Smooth Injective Map Recognizer (SIMR) is a new bitext mapping algorithm. SIMR's errors are smaller than those of the previous front-runner by more than a factor of 4. Its robustness has en- abled new commercial-quality applications. The greedy nature of the algorithm makes it independent of memory resources. Unlike other bitext mapping algorithms, SIMR allows crossing correspondences to account for word order differences. Its output can be converted quickly and easily into a sen- tence alignment. SIMR's output has been used to align more than 200 megabytes of the Canadian Hansards for publication by the Linguistic Data Consortium.

Fast and Accurate Sentence Alignment of Bilingual Corpora

by Robert C. Moore - In Stephen D , 2002
"... Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence ..."
Abstract - Cited by 41 (1 self) - Add to MetaCart
Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual lexicon. Our method adapts and combines these approaches, achieving high accuracy at a modest computational cost, and requiring no knowledge of the languages or the corpus beyond division into words and sentences. 1

Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping

by Pascale Fung, Kathleen Mckeown - In Proceedings of the First Conference of the Association for Machine Translation in the Americas, 81--88 , 1994
"... We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. ..."
Abstract - Cited by 35 (4 self) - Add to MetaCart
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.

Finding Terminology Translations From Non-Parallel Corpora

by Pascale Fung, Kathleen Mckeown , 1997
"... this paper, we present an initial algorithm for translating technical terms using a pair of non-parallel corpora. Evalution results show translation precisions at around 30% when only the top candidate is considered. While this precision is lower than that achieved with parallel corpora, we show tha ..."
Abstract - Cited by 34 (3 self) - Add to MetaCart
this paper, we present an initial algorithm for translating technical terms using a pair of non-parallel corpora. Evalution results show translation precisions at around 30% when only the top candidate is considered. While this precision is lower than that achieved with parallel corpora, we show that top 20 candidate output from our algorithm allows translators to increase their accuracy by 50.9%. In the following sections, we first describe a pair of non-parallel corpora we use for experiments, and then we introduce the Word Relation Matrix (WoRM), a statistical word feature representation for technical term translation from non-parallel corpora. We evaluate the effectiveness of this feature with two sets of experiments, using English/English, and English/Japanese non-parallel corpora. 2. BACKGROUND

Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

by Dekai Wu - In Proceedings of the Fourth Conference on Applied Natural Language Processing , 1994
"... The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been bas ..."
Abstract - Cited by 33 (13 self) - Add to MetaCart
The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang & Chen 1993;

Grammarless Extraction of Phrasal Translation Examples from Parallel Texts

by Dekai Wu - In Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation , 1995
"... We describe a method for identifying subsentential phrasal translation examples in sentencealigned parallel corpora, using only a probabilistic translation lexicon for the language pair. Our method differs from previous approaches in that (1) it is founded on a formal basis, making use of an inversi ..."
Abstract - Cited by 31 (7 self) - Add to MetaCart
We describe a method for identifying subsentential phrasal translation examples in sentencealigned parallel corpora, using only a probabilistic translation lexicon for the language pair. Our method differs from previous approaches in that (1) it is founded on a formal basis, making use of an inversion transduction grammar (ITG) formalism that we recently developed for bilingual language modeling, and (2) it requires no language-specific monolingual grammars for the source and target languages. Instead, we devise a generic, language-independent constituent-matching ITG with inherent expressiveness properties that correspond to a desirable level of matching flexibility. Bilingual parsing, in conjunction with a stochastic version of the ITG formalism, performs the phrasal translation extraction. The Hong Kong University of Science & Technology Technical Report Series Department of Computer Science TMI-95 WU 2 1 Introduction Phrasal translation examples at the subsentential level are an...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University