Results 1 - 10
of
12
A Bit of Progress in Language Modeling
, 2001
"... Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1 ..."
Abstract
-
Cited by 70 (1 self)
- Add to MetaCart
Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. a Katz-smoothed trigram model). There are many improvements over this simple model however, including caching, clustering, higherorder n-grams, skipping models, and sentence-mixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this...
Algorithms For Bigram And Trigram Word Clustering
- Speech Communication
, 1995
"... . This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
. This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus. 1. INTRODUCTION Word equivalence classes are a method for improving undertrained word M--gram language models [1], [2], [4]. Words are grouped into classes, and each word belongs to only one such class. Thus, if a word pair is not seen in training, it is quite likely that the corresponding class pair is seen. For bigram and trigram class models, we have the equations p(wn jwn\Gamma1 ) = p0(wn jG(wn)) (1) \Deltap 1(G(wn)jG(wn\Gamma1)) p(wn jwn\Gamma2 ; wn\Gam...
Modeling Out-Of-Vocabulary Words For Robust Speech Recognition
, 2000
"... This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognize ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance.
Towards Better Integration Of Semantic Predictors In Statistical Language Modeling
- In Proceedings of ICSLP-98
, 1998
"... We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal testset than a baseline N-gram model. 1. INTRODUCTION There has been a lot of recent work on augmenting n-gram language models with other information sources such as longer distance syntactic, and semantic constraints (e.g. [8], [6]). In previous ...
The Use of Clustering Techniques for Language Modeling - Application to Asian Languages
"... Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplex ..."
Abstract
-
Cited by 15 (11 self)
- Add to MetaCart
Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus, and on a Chinese heterogeneous corpus.
A Category Based Approach for Recognition of Out-of-Vocabulary Words
- In Int. Conf. on Spoken Language Processing
, 1996
"... In almost all applications of automatic speech recognition, especially in spontaneous speech tasks, the recognizer vocabulary cannot cover all occurring words. There is always a significant amount of out-of-vocabulary words even when the vocabulary size is very large. In this paper we present a new ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
In almost all applications of automatic speech recognition, especially in spontaneous speech tasks, the recognizer vocabulary cannot cover all occurring words. There is always a significant amount of out-of-vocabulary words even when the vocabulary size is very large. In this paper we present a new approach for the integration of out-of-vocabulary words into statistical language models. We use category information for all words in the training corpus to define a function that gives an approximation of the out-of-vocabulary word emission probability for each word category. This information is integrated into the language models. Although we use a simple acoustic model for out-of-vocabulary words, we achieve a 6% reduction of word error rate on spontaneous speech data with about 5% out-of-vocabulary rate.
Exploring Asymmetric Clustering for Statistical Language Modeling
- Proceedings of the Fortieth Annual Meeting of the Association for Computational Linguistics (ACL’2002). Philadelphia
, 2002
"... The n-gram model is a stochastic model, which predicts the next word (predicted word) given the previous words (conditional words) in a word sequence. ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The n-gram model is a stochastic model, which predicts the next word (predicted word) given the previous words (conditional words) in a word sequence.
Document Space Models Using Latent Semantic Analysis
- In Eurospeech
, 1997
"... In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. Using such information, the corpus texts are clustered in an unsupervised manner and mixture LMs are automatically created. This work builds on previous work in the field of information retrieval which was recently applied by Bellegarda et. al. to the problem of clustering words by semantic categories. The principal contribution of this work is to characterize the document space resulting from the LSA modeling and to demonstrate the approach for mixture LM application. Comparison is made between manual and automatic clustering in order to elucidate how the semantic information is expressed in the space. It is shown that, using semantic information, mixture LMs performs better than ...
Topic-Based Mixture Language Modelling
, 2000
"... This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. U ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (...
Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling
- ACL WORKSHOP ON VERY LARGE CORPORA PROC
, 1998
"... Phrase-based language models have been recognized to have an advantage over word-based language models since they allow us to capture long spanning dependencies. Class based language models have been used to improve model generalization and overcome problems with data sparseness. In this paper, we p ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Phrase-based language models have been recognized to have an advantage over word-based language models since they allow us to capture long spanning dependencies. Class based language models have been used to improve model generalization and overcome problems with data sparseness. In this paper, we present a novel approach for combining the phrase acquisition with class construction process to automatically acquire phrase-grammar fragments from a given corpus. The phrase-grammar learning is decomposed into two sub-problems, namely the phrase acquisition and feature selection. The phrase acquisition is based on entropy minimization and the feature selection. is driven by the entropy reduction principle. We further demonstrate that the phrasegrammar based n-gram language model significantly outperforms a phrase-based n-gram language model in an end-to-end evaluation of a spoken language application.

