## Training continuous space language models: some practical issues

Citations: | 4 - 0 self |

### BibTeX

@MISC{Son_trainingcontinuous,

author = {Le Hai Son and Re Allauzen and Guillaume Wisniewski and François Yvon},

title = {Training continuous space language models: some practical issues},

year = {}

}

### OpenURL

### Abstract

Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms of perplexity and on a large-scale translation task. 1

### Citations

1670 | Bleu: a method for automatic evaluation of machine translation
- Papineni, Roukos, et al.
- 2001
(Show Context)
Citation Context ...smoothing described in (Allauzen et al., 2009). The weights used during the reranking are tuned using the Minimum Error Rate Training algorithm (Och, 2003). Performance is measured based on the BLEU (=-=Papineni et al., 2002-=-) scores, which are reported in Table 4. Table 4: BLEU scores on the NIST MT08 test set with different language models. Vc size Model # epochs BLEU all baseline - 37.8 10000 log bilinear 6 38.2 standa... |

1014 | Moses: Open source toolkit for statistical machine translation
- Koehn, Hoang, et al.
- 2007
(Show Context)
Citation Context ...eviously described in section 3.1). The development data is again the 2006 NIST test set and the test data is the official 2008 NIST test set. Our system is built using the open-source Moses toolkit (=-=Koehn et al., 2007-=-) with default settings. To set up our baseline results, we used an extensively optimized standard back-off 4-grams language model using Kneser-Ney smoothing described in (Allauzen et al., 2009). The ... |

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language
- SF, Goodman
- 1999
(Show Context)
Citation Context ...ends to underestimate the probability of very rare n-grams, which are hardly observed even in huge corpora. Conventional smoothing techniques, such as KneserNey and Witten-Bell back-off schemes (see (=-=Chen and Goodman, 1996-=-) for an empirical overview, and (Teh, 2006) for a Bayesian interpretation), perform back-off on lower order distributions to provide an estimate for the probability of these unseen events. n-gram lan... |

738 | Class-Based N-Gram Models of Natural Language
- Brown, Pietra
- 1992
(Show Context)
Citation Context ...the lexicon are completely ignored, which negatively impact the generalization performance of the model. Various approaches have proposed to overcome this limitation, notably the use of word-classes (=-=Brown et al., 1992-=-; Niesler, 1997), of generalized back-off strategies (Bilmes et al., 1997) or the explicit integration of morphological information in the random-forest model (Xu and Jelinek, 2004; Oparin et al., 200... |

384 |
Reducing the dimensionality of data with neural networks
- Hinton, Salakhutdinov
- 2006
(Show Context)
Citation Context ...tialization and the training scheme for neural network language models. Both our experimental results and our new training methods can be closely related to the pre-training techniques introduced by (=-=Hinton and Salakhutdinov, 2006-=-). Our future work will thus aim at studying the connections between our empirical observations and the deep learning framework. Acknowledgments This work was partly realized as part of the Quaero Pro... |

257 | A Maximum Entropy Approach to Adaptive Statistical Language Modeling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...n drastically reduces the number of free parameters of the LBL model. It is finally noteworthy to outline the similarity of this model with standard maximum entropy language models (Lau et al., 1993; =-=Rosenfeld, 1996-=-). Let x denote the binary vector formed by stacking the (n-1) 1-of-V encodings of the history words; then the conditional probability distributions estimated in the model are proportional to exp F (x... |

166 | A neural probabilistic language model
- Bengio, Ducharme, et al.
- 2003
(Show Context)
Citation Context ...gration of morphological information in the random-forest model (Xu and Jelinek, 2004; Oparin et al., 2008). One of the most successful alternative to date is to use distributed word representations (=-=Bengio et al., 2003-=-), where distributionally similar words are represented as neighbors in a continuous space. This 778 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 778–7... |

123 |
Statistical Machine Translation
- Koehn
(Show Context)
Citation Context ...iments As a last experiment, we compare the various models on a large scale machine translation task. Statistical language models are key component of current statistical machine translation systems (=-=Koehn, 2010-=-), where they both help disambiguate lexical choices in the target language and influence the choice of the right word ordering. The integration of a neural network language model in such a system is ... |

89 | A hierarchical Bayesian language model based on Pitman-Yor processes - Teh |

54 | A scalable hierarchical distributed language model
- Mnih, Hinton
- 2008
(Show Context)
Citation Context ...ing increasingly used. These successes have revitalized the research on neuronal architectures for language models, and given rise to several new proposals (see, for instance, (Mnih and Hinton, 2007; =-=Mnih and Hinton, 2008-=-; Collobert and Weston, 2008)). A major difficulty with these approaches remains the complexity of training, which does not scale well to the massive corpora that are nowadays available. Practical sol... |

47 | Three new graphical models for statistical language modelling
- Mnih, Hinton
- 2007
(Show Context)
Citation Context ...nguage models are becoming increasingly used. These successes have revitalized the research on neuronal architectures for language models, and given rise to several new proposals (see, for instance, (=-=Mnih and Hinton, 2007-=-; Mnih and Hinton, 2008; Collobert and Weston, 2008)). A major difficulty with these approaches remains the complexity of training, which does not scale well to the massive corpora that are nowadays a... |

37 |
Continuous space language models
- Schwenk
- 2007
(Show Context)
Citation Context ...probability estimates are jointly computed in a multi-layer neural network architecture. This approach has showed significant and consistent improvements when applied to automatic speech recognition (=-=Schwenk, 2007-=-; Emami and Mangu, 2007; Kuo et al., 2010) and machine translation tasks (Schwenk et al., 2006). Hence, continuous space language models are becoming increasingly used. These successes have revitalize... |

14 | Category-based statistical language models
- Niesler
- 1997
(Show Context)
Citation Context ...letely ignored, which negatively impact the generalization performance of the model. Various approaches have proposed to overcome this limitation, notably the use of word-classes (Brown et al., 1992; =-=Niesler, 1997-=-), of generalized back-off strategies (Bilmes et al., 1997) or the explicit integration of morphological information in the random-forest model (Xu and Jelinek, 2004; Oparin et al., 2008). One of the ... |

9 |
Syntactic features for Arabic speech recognition
- Kuo, Mangu, et al.
- 2009
(Show Context)
Citation Context ...uted in a multi-layer neural network architecture. This approach has showed significant and consistent improvements when applied to automatic speech recognition (Schwenk, 2007; Emami and Mangu, 2007; =-=Kuo et al., 2010-=-) and machine translation tasks (Schwenk et al., 2006). Hence, continuous space language models are becoming increasingly used. These successes have revitalized the research on neuronal architectures ... |

6 | Using PHIPAC to speed error back-propagation learning
- Bilmes, Asanovic, et al.
- 1997
(Show Context)
Citation Context ...zation performance of the model. Various approaches have proposed to overcome this limitation, notably the use of word-classes (Brown et al., 1992; Niesler, 1997), of generalized back-off strategies (=-=Bilmes et al., 1997-=-) or the explicit integration of morphological information in the random-forest model (Xu and Jelinek, 2004; Oparin et al., 2008). One of the most successful alternative to date is to use distributed ... |

3 | LIMSI’s statistical translation system for WMT’09
- Allauzen, Crego, et al.
- 2009
(Show Context)
Citation Context ... toolkit (Koehn et al., 2007) with default settings. To set up our baseline results, we used an extensively optimized standard back-off 4-grams language model using Kneser-Ney smoothing described in (=-=Allauzen et al., 2009-=-). The weights used during the reranking are tuned using the Minimum Error Rate Training algorithm (Och, 2003). Performance is measured based on the BLEU (Papineni et al., 2002) scores, which are repo... |

2 |
Empirical study of neural network language models for Arabic speech recognition
- Emami, Mangu
- 2007
(Show Context)
Citation Context ...imates are jointly computed in a multi-layer neural network architecture. This approach has showed significant and consistent improvements when applied to automatic speech recognition (Schwenk, 2007; =-=Emami and Mangu, 2007-=-; Kuo et al., 2010) and machine translation tasks (Schwenk et al., 2006). Hence, continuous space language models are becoming increasingly used. These successes have revitalized the research on neuro... |

1 | Morphological random forests for language modeling of inflectional languages - Oparin, Glembek, et al. - 2008 |