Results 1 - 10
of
18
A Maximum Entropy Approach to Adaptive Statistical Language Modeling
- Computer, Speech and Language
, 1996
"... An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's histor ..."
Abstract
-
Cited by 201 (11 self)
- Add to MetaCart
An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic information bearing elements. This allows the model to adapt its expectations to the topic of discourse. Next, statistical evidence from multiple sources must be combined. Traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. Instead, we apply the principle of Maximum Entropy (ME). Each information source gives rise to a set of constraints, to be imposed on the combined estimate. The intersection of these constraints is the set of probability functions which are consistent with all the information sources. The function with the highest entropy within that set is the ME solution...
Similarity-Based Estimation of Word Cooccurrence Probabilities
- In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics
, 1994
"... In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the ..."
Abstract
-
Cited by 65 (7 self)
- Add to MetaCart
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on "most sim- ilar" words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.
Building Probabilistic Models for Natural Language
, 1996
"... Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistic ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistically-trained models are an attractive alternative. These models are generally probabilistic, yielding a score reflecting sentence frequency instead of a binary grammaticality judgement. Probabilistic models of language are a fundamental tool in speech recognition for resolving acoustically ambiguous utterances. For example, we prefer the transcription forbear to four bear as the former string is far more frequent in English text. Probabilistic models also have application in optical character recognition, handwriting recognition, spelling correction, part-of-speech tagging, and machine translation. In this thesis, we investigate three problems involving the probabilistic modeling of languag...
Similarity-based approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
A Hybrid Approach To Adaptive Statistical Language Modeling
- Proceedings of the ARPA workshop on human language technology
, 1994
"... We desert'be our latest attempt at adaptive language modeling. At the heart of our approach is a Maximum Entropy (ME) model which inc.orlxnates many knowledge sources in a consistent manner. The other components are a selective unigram cache, a conditional bigram cache, and a conventionalstatic trig ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
We desert'be our latest attempt at adaptive language modeling. At the heart of our approach is a Maximum Entropy (ME) model which inc.orlxnates many knowledge sources in a consistent manner. The other components are a selective unigram cache, a conditional bigram cache, and a conventionalstatic trigram. We describe the knowledge sources used to build such a model with ARPA's official WSJ corpus, and report on perplexity and word error rate results obtained with it. Then, three different adaptation paradigms are discussed, and an additional experiment, based on AP wire data, is used to compare them. 1. OVERVIEW OF ME FRAMEWORK Using several different probability estimates to arrive at one combined estimate is a general problem that arises in many tasks. The Maximum Entropy (ME) principle has recently been demonstrated as a powerful tool for combining statistical estimates from diverse sources[l, 2, 3]. The ME principle ([4, 5]) proposes the following: 1. Reformulate the different estimates as constraints on the expectation of various functions, to be satisfied by the target (combined) estimate. 2. Among all probability distributions that satisfy these con-straints, choose the one that has the highest entropy. More specifically, for estimating a probability function P(x), each constraint i is associated with a constraintfunctionfi(x) and a desired expectation ci. The constraint is then written as: def E Eefi = P(x)fi(x) = ci. (1) X Given consistent constraints, a unique ME solutions is guar-anteed to exist, and to be of the form: P(x) = II mf'°°, (2) i where the pi's are some unknown constants, to be found. Probability functions of the form (2) are called log-linear, and the family of functions defined by holding thefi's fixed and varying the pi's is called an exponential family.
A Class-Based Approach To Lexical Discovery
- In Proc. of the 30th Ann. Meet. of the Assoc. for Computational Linguistics (ACL-92
, 1992
"... this paper I propose a generalization of lexical association techniques that is intended to facilitate statistical discovery of facts involving word classes rather than individual words. Although defining association measures over classes (as sets of words) is straightforward in theory, making direc ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
this paper I propose a generalization of lexical association techniques that is intended to facilitate statistical discovery of facts involving word classes rather than individual words. Although defining association measures over classes (as sets of words) is straightforward in theory, making direct use of such a definition is impractical because there are simply too many classes to consider. Rather than considering all possible classes, I propose constraining the set of possible word classes by using a broad-coverage lexical/conceptual hierarchy [Miller, 1990]
2002. Adding intelligent help to mixed-initiative spoken dialogue systems
- In ACL-02 Companion Volume to the Proceedings of the Conference, page 95, Philadelphia. Association for Computational Linguistics
, 2002
"... The rapidly expanding voice recognition industry has so far shown a preference for grammar-based language modelling, despite the better overall performance of statistical language modelling. Given that the advantages of the grammar-based approach make it unlikely to be replaced as the primary soluti ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The rapidly expanding voice recognition industry has so far shown a preference for grammar-based language modelling, despite the better overall performance of statistical language modelling. Given that the advantages of the grammar-based approach make it unlikely to be replaced as the primary solution in the near future, it is natural to wonder whether some combination of the two approaches may prove useful. Here, we describe an implemented system that uses statistical language modelling and a decision-tree classifier to provide the user with some feedback when grammarbased recognition fails. Users of this system had more successful interactions than did users of a control system. 1.
Category-Based Statistical Language Models
, 1997
"... this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams. ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams.
Large Vocabulary Continuous Speech Recognition: from Laboratory Systems towards Real-World Applications
, 1996
"... This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper provides an overview of the state-of-the-art in laboratory speaker-independent, large vocabulary continuous speech recognition (LVCSR) systems with a view towards adapting such technology to the requirements of real-world applications. While in speech recognition the principal concern is to transcribe the speech signal as a sequence of words, the same core technology can be applied to domains other than dictation. The main topics addressed are acoustic-phonetic modeling, lexical representation, language modeling, decoding and model adaptation. After a brief summary of experimental results some directions towards usable systems are given. In moving from laboratory systems towards real-world applications, different constraints arise which influence the system design. The application imposes limitations on computational resources, constraints on signal capture, requirements for noise and channel compensation, and rejection capability. The difficulties and costs of adapting existing technology to new languages and application need to be assessed. Near term applications for LVCSR technology are likely to grow in somewhat limited domains such as spoken language systems for information retrieval, and limited domain dictation. Perspectives on some unresolved problems are given, indicating areas for future research

