Results 1 -
5 of
5
Towards Better Integration Of Semantic Predictors In Statistical Language Modeling
- In Proceedings of ICSLP-98
, 1998
"... We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal testset than a baseline N-gram model. 1. INTRODUCTION There has been a lot of recent work on augmenting n-gram language models with other information sources such as longer distance syntactic, and semantic constraints (e.g. [8], [6]). In previous ...
A Comparative Evaluation of Data-driven Models in Translation
"... We present a comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation. Latent semantic analysis(LSA) and probabilistic latent semantic analysis (PLSA) are applied for the purpose of implementation of data-driven models in particular. These ..."
Abstract
- Add to MetaCart
We present a comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation. Latent semantic analysis(LSA) and probabilistic latent semantic analysis (PLSA) are applied for the purpose of implementation of data-driven models in particular. These models are able to represent complex semantic structures of given contexts, like text passages. Grammatical relationships, stored in dictionaries, are utilized in translation selection essentially. We have used k-nearest neighbor (k-NN) learning to select an appropriate translation of the unseen instances in the dictionary. The distance of instances in k-NN is computed by estimating the similarity measured by LSA and PLSA. For experiments, we used TREC data(AP news in 1988) for constructing latent semantic spaces of two models and Wall Street Journal corpus for evaluating the translation accuracy in each model. PLSA selected relatively more accurate translations than LSA in the experiment, irrespective of the value of k and the types of grammatical relationship.
Dynamic Topic Analysis: Classification Without Established Classes using Distance Thresholds
"... Document classification has proven to be useful for problems where the target function range is defined for a fixed set of classes, but many problems require organization of documents where possible classes are not previously established. This paper describes an unsupervised text classification syst ..."
Abstract
- Add to MetaCart
Document classification has proven to be useful for problems where the target function range is defined for a fixed set of classes, but many problems require organization of documents where possible classes are not previously established. This paper describes an unsupervised text classification system that uses a variant of the k Nearest Neighbor (kNN) method for classification of document instances and a Zipf filter to limit the feature space dimension. The classifier is iterative, relying on training data only for feature selection and delaying class assignment until the query phase. Document instances presented during the query phase are either added to an existing, derived class or assigned to a newly created class. Using a conditional entropy measure to evaluate how well the method partitions document instances, it is shown that the performance of such a classifier can be improved by reducing the dimensionality of the feature space using Latent Semantic Indexing (LSI). 1.
Cache-based Statistical Language Models of English and Highly Inflected Lithuanian
, 2005
"... Abstract. This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs ..."
Abstract
- Add to MetaCart
Abstract. This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3-gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36 % and 43 % perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16 % perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.
A Comparative Evaluation of Data-driven Models in Translation Selection of Machine Translation ∗
"... We presents a comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation. Latent semantic analysis(LSA) and probabilistic latent semantic analysis (PLSA) are applied for the purpose of implementation of data-driven models in particular. These ..."
Abstract
- Add to MetaCart
We presents a comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation. Latent semantic analysis(LSA) and probabilistic latent semantic analysis (PLSA) are applied for the purpose of implementation of data-driven models in particular. These models are able to represent complex semantic structures of given contexts, like text passages. Grammatical relationships, stored in dictionaries, are utilized in translation selection essentially. We have used k-nearest neighbor (k-NN) learning to select an appropriate translation of the unseen instances in the dictionary. The distance of instances in k-NN is computed by estimating the similarity measured by LSA and PLSA. For experiments, we used TREC data(AP news in 1988) for constructing latent semantic spaces of two models and Wall Street Journal corpus for evaluating the translation accuracy in each model. PLSA selected relatively more accurate translations than LSA in the experiment, irrespective of the value of k and the types of grammatical relationship. 1

