Results 11  20
of
272
Two decades of statistical language modeling: Where do we go from here
 Proceedings of the IEEE
, 2000
"... Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here ..."
Abstract

Cited by 147 (1 self)
 Add to MetaCart
Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data. 1. OUTLINE Statistical language modeling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. By and large, statistical language modeling amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents. Statistical language modeling is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical approaches have been introduced in [1]. But even researchers using rulebased approaches have found it beneficial to introduce some elements of SLM and statistical estimation [2]. In information retrieval, a language modeling approach was recently proposed by [3], and a statistical/information theoretical approach was developed by [4]. SLM employs statistical estimation techniques using language training data, that is, text. Because of the categorical nature of language, and the large vocabularies people naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data.
Using statistics in lexical analysis
 Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon
, 1991
"... The computational tools available for studying machinereadable corpora are at present still rather primitive. In the more advanced lexicographic organizations, there are concordancing programs (see figure below), which are basically KWIC (key word in context (Aho et al., 1988, p. 122), (Salton, 198 ..."
Abstract

Cited by 144 (3 self)
 Add to MetaCart
The computational tools available for studying machinereadable corpora are at present still rather primitive. In the more advanced lexicographic organizations, there are concordancing programs (see figure below), which are basically KWIC (key word in context (Aho et al., 1988, p. 122), (Salton, 1989, p. 384)) indexes with additional features such as the ability to extend the context, sort leftwards as well as
Using Corpus Statistics and WordNet Relations for Sense Identification
, 1998
"... Introduction An impressive array of statistical methods have been developed for word sense identification. They range from dictionarybased approaches that rely on definitions (Vronis and Ide 1990; Wilks et al. 1993) to corpusbased approaches that use only word cooccurrence frequencies extracted f ..."
Abstract

Cited by 143 (0 self)
 Add to MetaCart
Introduction An impressive array of statistical methods have been developed for word sense identification. They range from dictionarybased approaches that rely on definitions (Vronis and Ide 1990; Wilks et al. 1993) to corpusbased approaches that use only word cooccurrence frequencies extracted from large textual corpora (Schfitze 1995; Dagan and Itai 1994). We have drawn on these two traditions, using corpusbased cooccurrence and the lexical knowledge base that is embodied in the WordNet lexicon. The two traditions complement each other. Corpusbased approaches have the advantage of being generally applicable to new texts, domains, and corpora without needing costly and perhaps errorprone parsing or semantic analysis. They require only training corpora in which the sense distinctions have been marked, but therein lies their weakness. Obtaining training materials for statistical methods is costly and timeconsuming it is a "knowledge acquisition bottleneck" (Gale, Church, and Y
Supertagging: An Approach to Almost Parsing
 Computational Linguistics
, 1999
"... this paper, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated wit ..."
Abstract

Cited by 134 (22 self)
 Add to MetaCart
this paper, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (Supertags) that impose complex constraints in a local context. The supertags are designed such that only those elements on which the lexical item imposes constraints appear within a given supertag. Further, each lexical item is associated with as many supertags as the number of different syntactic contexts in which the lexical item can appear. This makes the number of different descriptions for each lexical item much larger, than when the descriptions are less complex; thus increasing the local ambiguity for a parser. But this local ambiguity can be resolved by using statistical distributions of supertag cooccurrences collected from a corpus of parses. We have explored these ideas in the context of Lexicalized TreeAdjoining Grammar (LTAG) framework. The supertags in LTAG combine both phrase structure information and dependency information in a single representation. Supertag disambiguation results in a representation that is effectively a parse (almost parse), and the parser needs `only' combine the individual supertags. This method of parsing can also be used to parse sentence fragments such as in spoken utterances where the disambiguated supertag sequence may not combine into a single structure. 1 Introduction In this paper, we present a robust parsing approach called supertagging that integrates the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. The idea underlying the approach is that the ...
Entropybased pruning of backoff language models
 In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop
"... A criterion for pruning parameters from Ngram backoff language models is developed, based on the relative entropy between the original and the pruned model. It is shown that the relative entropy resulting from pruning a single Ngram can be computed exactly and efficiently for backoff models. The r ..."
Abstract

Cited by 119 (7 self)
 Add to MetaCart
A criterion for pruning parameters from Ngram backoff language models is developed, based on the relative entropy between the original and the pruned model. It is shown that the relative entropy resulting from pruning a single Ngram can be computed exactly and efficiently for backoff models. The relative entropy measure can be expressed as a relative change in training set perplexity. This leads to a simple pruning criterion whereby all Ngrams that change perplexity by less than a threshold are removed from the model. Experiments show that a productionquality Hub4 LM can be reduced to 26 % its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of Ngrams (about 85% overlap), with the exact relative entropy criterion giving marginally better performance. 1.
Introduction to the Special Issue on Computational Linguistics using Large Corpora
 Computational Linguistics
, 1993
"... ..."
Unsupervised Learning from Dyadic Data
, 1998
"... Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event cooccurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applic ..."
Abstract

Cited by 100 (9 self)
 Add to MetaCart
Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event cooccurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applications ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domainindependent framework for unsupervised learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures and unifies probabilistic modeling and structure discovery. Mixture models provide both, a parsimonious yet flexible parameterization of probability distributions with good generalization performance on sparse data, as well as structural information about datainherent grouping structure. We propose an annealed version of the standard Expectation Maximization algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.
Similaritybased models of word cooccurrence probabilities
 Machine Learning
, 1999
"... Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP met ..."
Abstract

Cited by 90 (0 self)
 Add to MetaCart
Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar ” words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudoword disambiguation. In the language modeling task, a similaritybased model is used to improve probability estimates for unseen bigrams in a backoff language model. The similaritybased method yields a 20 % perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speechrecognition error. We also compare four similaritybased estimation methods against backoff and maximumlikelihood estimation methods on a pseudoword sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easytodisambiguate highfrequency configurations. The similaritybased methods perform up to 40 % better on this particular task.
A Bit of Progress in Language Modeling
, 2001
"... Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1 ..."
Abstract

Cited by 87 (2 self)
 Add to MetaCart
Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. a Katzsmoothed trigram model). There are many improvements over this simple model however, including caching, clustering, higherorder ngrams, skipping models, and sentencemixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this...
A survey of smoothing techniques for ME models
 IEEE Transactions on Speech and Audio Processing
, 2000
"... Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous r ..."
Abstract

Cited by 85 (1 self)
 Add to MetaCart
Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in ME smoothing and compare the performance of several of these algorithms with conventional techniques for smoothinggram language models. Because of the mature body of research ingram model smoothing and the close connection between ME and conventionalgram models, this domain is wellsuited to gauge the performance of ME smoothing methods. Over a large number of data sets, we find that fuzzy ME smoothing performs as well as or better than all other algorithms under consideration. We contrast this method with previousgram smoothing methods to explain its superior performance. Index Terms—Exponential models, language modeling, maximum entropy, minimum divergence,gram models, smoothing.