Results 1 -
1 of
1
Language Modeling with Limited Domain Data
- Proceeding of the 1995 ARPA Workshop on Spoken Language Technology
, 1995
"... Generic recognition systems contain language models which are representative of a broad corpus. In actual practice, however, recognition is usually on a coherent text covering a single topic, suggesting that knowledge of the topic at hand can be used to advantage. A base model can be augmented with ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Generic recognition systems contain language models which are representative of a broad corpus. In actual practice, however, recognition is usually on a coherent text covering a single topic, suggesting that knowledge of the topic at hand can be used to advantage. A base model can be augmented with information from a small sample of domain-specific language data to significantly improve recognition performance. Good performance may be obtained by merging in only those n-grams that include words that are out of vocabulary with respect to the base model. 1. Introduction Current language modeling practice requires access to a substantial amount of text from a target domain in order to create a reliable language model. For the North American Business (CSR NAB) domain, 227M words were available. Of necessity models based on large corpora cover a diversity of material and are fairly general in nature. In practice, a given sequence of input utterances (say a dictation) will stick to a parti...

