• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Unsupervised segmentation for statistical machine translation (2003)

by S Sereewattana
Add To MetaCart

Tools

Sorted by:
Results 1 - 2 of 2

Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner

by Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, Markus Sadeniemi - PROC. OF MT SUMMIT XI , 2007
"... In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the tra ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morphemelike units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish, and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced.

Translating Between Closely Related Languages in Statistical Machine Translation

by Bryce Miller
"... Minor languages are gaining more and more status these days, but there is still little parallel data for minority languages, which can be used by a normal SMT system. If we are to translate these languages while still eschewing rule-based systems, then something different must be done with the resou ..."
Abstract - Add to MetaCart
Minor languages are gaining more and more status these days, but there is still little parallel data for minority languages, which can be used by a normal SMT system. If we are to translate these languages while still eschewing rule-based systems, then something different must be done with the resources we currently have. The approach was to design and implement a translation model which took advantage of the cross-linguistic correspondences in closely related languages, and the sub-word level. This model was tested against a baseline for accuracy. The model was adjusted by varying word segment sizes, and by varying weightings. Ten language pairs were used. Language pairs translating from Swedish had a marginal improvement above the baseline (0.9 % into Danish, 0.7 % into Norwegian). All other language pairs saw no improvement above the baseline, in any experiment. While the method has not worked for the majority of language pairs, the marginal improvement shown by pairs from Swedish means that this method can work for certain language pairs, and perhaps could work for all, after some improvement. 2
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University