Results 1 -
5 of
5
Rapid language model development for new task domains
- Proc. First International Conference on Language Resources and Evaluation (LREC
, 1998
"... Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The f ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free grammar to generate a corpus of word collocations. The second is an adaptation technique based on using out-of-domain corpora to estimate target domain language models. We report results of successfully using these two techniques individually and in combination to build efficient models for a spontaneous speech recognition task in a medium-sized vocabulary domain. 1.
Domain adaptation with clustered language models
- In Proceedings of International Conference on Acoustics, Speech and Signal Processing
, 1997
"... In this paper, a method of domain adaptation for clustered language models is developed. It is based on a previously developed clustering algorithm, but with a modified optimisation criterion. The results are shown to be slightly superior to the previously published ’Fillup ’ method, which can be us ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this paper, a method of domain adaptation for clustered language models is developed. It is based on a previously developed clustering algorithm, but with a modified optimisation criterion. The results are shown to be slightly superior to the previously published ’Fillup ’ method, which can be used to adapt standard n-gram models. However, the improvement both methods give compared to models built from scratch on the adaptation data is quite small (less than 11 % relative improvement in word error rate). This suggests that both methods are still unsatisfactory from a practical point of view. 1
Analyzing And Improving Statistical Language Models For Speech Recognition
, 1994
"... A speech recognizer is a device that translates speech into text. Many current speech recognizers contain two components, an acoustic model and a statistical language model. The acoustic model indicates how likely it is that a certain word corresponds to a part of the acoustic signal (e.g. the speec ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A speech recognizer is a device that translates speech into text. Many current speech recognizers contain two components, an acoustic model and a statistical language model. The acoustic model indicates how likely it is that a certain word corresponds to a part of the acoustic signal (e.g. the speech). The statistical language model indicates how likely it is that a certain word will be spoken next, given the words recognized so far. Even though the acoustic model might for example not be able to decide between the acoustically similar words "peach" and "teach", the statistical language model can indicate that the word "peach" is more likely if the previously recognized words are "He ate the". Current speech recognizers perform well on constrained tasks, but the goal of continuous, speaker independent speech recognition in potentially noisy environments with a very large vocabulary has not been reached so far. How can statistical language models be improved so that more complex tasks c...
A Comparative Study of Topic Identification on Newspaper and E-mail
- in Proceedings of the String Processing and Information Retrieval Conference (SPIRE2001
, 2001
"... This paper presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classifier, topic perplexity, and weighted model. Our work aims to study these metho ..."
Abstract
- Add to MetaCart
This paper presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classifier, topic perplexity, and weighted model. Our work aims to study these methods by confronting them to very different data. This study is very fruitful for our research. Statistical topic identification methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identification on a general newspaper corpus but does not exceed on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
TASK ADAPTATION USING MAP ESTIMATION IN N-GRAM LANGUAGE MODELING
"... This paper describes a method of task adaptation in N-gram language modeling, for accurately estimating the N-gram statistics from the small amount of data of the target task. Assuming a task-independent N-gram to be a-priori knowledge, the N-gram is adapted to a target task by MAP (maximum a-poster ..."
Abstract
- Add to MetaCart
This paper describes a method of task adaptation in N-gram language modeling, for accurately estimating the N-gram statistics from the small amount of data of the target task. Assuming a task-independent N-gram to be a-priori knowledge, the N-gram is adapted to a target task by MAP (maximum a-posteriori probability) estimation. Experimental results showed that the perplexities of the task adapted models were 15 % (trigram), 24 % (bigram) lower than those of the task-independent model, and that the perplexity reduction of the adaptation went upto39% at maximum when the amount of text data in the adapted task was very small. 1.

