Results 1 -
5 of
5
Translating with non-contiguous phrases
- In EMNLP
, 2005
"... This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is proposed. A statistical translation model is also presented that deals such phrases, as well as a tra ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is proposed. A statistical translation model is also presented that deals such phrases, as well as a training method based on the maximization of translation accuracy, as measured with the NIST evaluation metric. Translations are produced by means of a beam-search decoder. Experimental results are presented, that demonstrate how the proposed method allows to better generalize from the training data. 1
KenLM: Faster and smaller language model queries
- In Proc. of the Sixth Workshop on Statistical Machine Translation
, 2011
"... We present KenLM, a library that implements two data structures for efficient language model queries, reducing both time and memory costs. The PROBING data structure uses linear probing hash tables and is designed for speed. Compared with the widelyused SRILM, our PROBING model is 2.4 times as fast ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We present KenLM, a library that implements two data structures for efficient language model queries, reducing both time and memory costs. The PROBING data structure uses linear probing hash tables and is designed for speed. Compared with the widelyused SRILM, our PROBING model is 2.4 times as fast while using 57 % of the memory. The TRIE data structure is a trie with bit-level packing, sorted records, interpolation search, and optional quantization aimed at lower memory consumption. TRIE simultaneously uses less memory than the smallest lossless baseline and less CPU than the fastest baseline. Our code is open-source1, thread-safe, and integrated into the Moses, cdec, and Joshua translation systems. This paper describes the several performance techniques used and presents benchmarks against alternative implementations. 1
Machine Translation System Combination with Flexible Word Ordering
"... We describe a synthetic method for combining machine translations produced by different systems given the same input. One-best outputs are explicitly aligned to remove duplicate words. Hypotheses follow system outputs in sentence order, switching between systems mid-sentence to produce a combined ou ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We describe a synthetic method for combining machine translations produced by different systems given the same input. One-best outputs are explicitly aligned to remove duplicate words. Hypotheses follow system outputs in sentence order, switching between systems mid-sentence to produce a combined output. Experiments with the WMT 2009 tuning data showed improvement of 2 BLEU and 1 METEOR point over the best Hungarian-English system. Constrained to data provided by the contest, our system was submitted to the WMT 2009 shared system combination task. 1
iii Acknowledgments
, 2008
"... I thank the Almighty for providing me with this opportunity to serve Him and make a contribution through His infinite wisdom. I thank my parents for their perseverance and unconditional support, without which I could never have accomplished this endeavor. I would also like to thank other members of ..."
Abstract
- Add to MetaCart
I thank the Almighty for providing me with this opportunity to serve Him and make a contribution through His infinite wisdom. I thank my parents for their perseverance and unconditional support, without which I could never have accomplished this endeavor. I would also like to thank other members of my family including my cousin Muneer who has been watching my back from day one. I want to extend my deep appreciation to Dr. Venu Govindaraju, the chair of my dissertation committee. He has been an advisor and a mentor. His persistent guidance, omnipresent motivation and overall support have been the foundation of this thesis. He introduced me to the area of handwriting recognition and encouraged me to address the open challenge of retrieval from handwritten documents. I want to show my gratitude to Dr. Peter Scott, member of my dissertation committee. His course Computer Vision and Image Processing indeed laid a solid foundation for iv this research. His guidance and advise has been always helpful. In addition, I had the opportunity to be his Teaching Assistant for three semesters and his passion for teaching was a great motivation.
Language Model Rest Costs and Space-Efficient Storage
"... Approximate search algorithms, such as cube pruning in syntactic machine translation, rely on the language model to estimate probabilities of sentence fragments. We contribute two changes that trade between accuracy of these estimates and memory, holding sentence-level scores constant. Common practi ..."
Abstract
- Add to MetaCart
Approximate search algorithms, such as cube pruning in syntactic machine translation, rely on the language model to estimate probabilities of sentence fragments. We contribute two changes that trade between accuracy of these estimates and memory, holding sentence-level scores constant. Common practice uses lowerorder entries in an N-gram model to score the first few words of a fragment; this violates assumptions made by common smoothing strategies, including Kneser-Ney. Instead, we use a unigram model to score the first word, a bigram for the second, etc. This improves search at the expense of memory. Conversely, we show how to save memory by collapsing probability and backoff into a single value without changing sentence-level scores, at the expense of less accurate estimates for sentence fragments. These changes can be stacked, achieving better estimates with unchanged memory usage. In order to interpret changes in search accuracy, we adjust the pop limit so that accuracy is unchanged and report the change in CPU time. In a German-English Moses system with target-side syntax, improved estimates yielded a 63 % reduction in CPU time; for a Hiero-style version, the reduction is 21%. The compressed language model uses 26 % less RAM while equivalent search quality takes 27 % more CPU. Source code is released as part of KenLM. 1

