Results 1 -
3 of
3
Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner
- PROC. OF MT SUMMIT XI
, 2007
"... In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the tra ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation (SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morphemelike units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish, and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced.
QUERY LANGUAGE MODELING FOR VOICE SEARCH
"... The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1 % for a one million word vocabulary. We present a comprehensive set of experiments that guided the desi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1 % for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales—USA, Britain and Australia. Index Terms — language modeling, voice search, query stream
Pruning Exponential Language Models
"... Abstract—Language model pruning is an essential technology for speech applications running on resource-constrained devices, and many pruning algorithms have been developed for conventional word n-gram models. However, while exponential language models can give superior performance, there has been li ..."
Abstract
- Add to MetaCart
Abstract—Language model pruning is an essential technology for speech applications running on resource-constrained devices, and many pruning algorithms have been developed for conventional word n-gram models. However, while exponential language models can give superior performance, there has been little work on the pruning of these models. In this paper, we propose several pruning algorithms for general exponential language models. We show that our best algorithm applied to an exponential n-gram model outperforms existing n-gram model pruning algorithms by up to 0.4 % absolute in speech recognition word-error rate on Wall Street Journal and Broadcast News data sets. In addition, we show that Model M, an exponential class-based language model, retains its performance improvement over conventional word n-gram models when pruned to equal size, with gains of up to 2.5 % absolute in word-error rate. I.

