Results 1  10
of
46
A Maximum Entropy Approach to Adaptive Statistical Language Modeling
 Computer, Speech and Language
, 1996
"... An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's h ..."
Abstract

Cited by 284 (12 self)
 Add to MetaCart
An adaptive statistical languagemodel is described, which successfullyintegrates long distancelinguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic information bearing elements. This allows the model to adapt its expectations to the topic of discourse. Next, statistical evidence from multiple sources must be combined. Traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. Instead, we apply the principle of Maximum Entropy (ME). Each information source gives rise to a set of constraints, to be imposed on the combined estimate. The intersection of these constraints is the set of probability functions which are consistent with all the information sources. The function with the highest entropy within that set is the ME solution...
Two decades of statistical language modeling: Where do we go from here
 Proceedings of the IEEE
, 2000
"... Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here ..."
Abstract

Cited by 194 (1 self)
 Add to MetaCart
(Show Context)
Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data. 1. OUTLINE Statistical language modeling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. By and large, statistical language modeling amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents. Statistical language modeling is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical approaches have been introduced in [1]. But even researchers using rulebased approaches have found it beneficial to introduce some elements of SLM and statistical estimation [2]. In information retrieval, a language modeling approach was recently proposed by [3], and a statistical/information theoretical approach was developed by [4]. SLM employs statistical estimation techniques using language training data, that is, text. Because of the categorical nature of language, and the large vocabularies people naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data.
A Maximum Entropy Model for Prepositional Phrase Attachment
 In Proceedings of the ARPA Workshop on Human Language Technology
, 1994
"... this paper methods for constructing statistical models for computing the probability of attachment decisions. These models could be then integrated into scoring the probability of an overall parse. We present our methods in the context of prepositional phrase (PP) attachment. ..."
Abstract

Cited by 138 (3 self)
 Add to MetaCart
(Show Context)
this paper methods for constructing statistical models for computing the probability of attachment decisions. These models could be then integrated into scoring the probability of an overall parse. We present our methods in the context of prepositional phrase (PP) attachment.
Improving Trigram Language Modeling with The World Wide Web
 Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01
, 2001
"... We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an Ngram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the Ngram count is estimated. The N ..."
Abstract

Cited by 54 (0 self)
 Add to MetaCart
(Show Context)
We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an Ngram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the Ngram count is estimated. The Ngram counts are then used to form webbased trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set. 1.
Domain Adaptation with Multiple Sources
"... This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with ..."
Abstract

Cited by 52 (4 self)
 Add to MetaCart
This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a realworld dataset. 1
Highlights: Language and domainindependent automatic indexing terms for abstracting
 Journal of the American Society for Information Science
, 1995
"... A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other languageand domainspecific component, allowing operation in any language or domain with only trivial modification. The method uses ngram counts, achieving a function similar to, but more gene ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other languageand domainspecific component, allowing operation in any language or domain with only trivial modification. The method uses ngram counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, which the author calls “highlights, ” are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Some experimental results are presented, showing operation in English, Spanish, German, Georgian, Russian, and Japanese.
Adaptive language modeling using the maximum entropy principle.” Human Language Technology
 Proceedings of a Workshop Held at Plainsboro
, 1993
"... We describe our ongoing efforts at adaptive statistical language modeling. Central to our approach is the Maximum Entropy (ME) Principle, allowing us to combine evidence from multiple sources, such as longdistance triggers and conventional short.distance trigrams. Given consistent statistical evide ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
(Show Context)
We describe our ongoing efforts at adaptive statistical language modeling. Central to our approach is the Maximum Entropy (ME) Principle, allowing us to combine evidence from multiple sources, such as longdistance triggers and conventional short.distance trigrams. Given consistent statistical evidence, a unique ME solution is guaranteed to exist, and an iterative algorithm exists which is guaranteed to converge to it. Among the advantages of this approach are its simplicity, its generality, and its incremental nature. Among its disadvantages are its computational requirements. We describe a succession of ME models, culminating in our current Maximum Likelihood / Maximum Entropy (ML/ME) model. Preliminary results with the latter show a 27 % perplexity reduction as compared to a conventional trigram model. 1. STATE OF THE ART
Domain Adaptation: Learning Bounds and Algorithms
"... This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by BenDavid et al. (2007), we introduce a novel distance between dist ..."
Abstract

Cited by 41 (7 self)
 Add to MetaCart
(Show Context)
This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by BenDavid et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive new generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularizationbased algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give several algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation. 1
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
 Proc. IEEE
, 2000
"... Recent advances in automatic speech recognition are accomplished by designing a plugin maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
(Show Context)
Recent advances in automatic speech recognition are accomplished by designing a plugin maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximumlikelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for highperformance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine
A Whole Sentence Maximum Entropy Language Model
 Proceedings of the IEEE Workshop on Speech Recognition and Understanding
, 1997
"... We introduce a new kind of language model, which models whole sentences or utterances directly using the Maximum Entropy paradigm. The new model is conceptually simpler, and more naturally suited to modeling wholesentence phenomena, than the conditional ME models proposed to date. By avoiding the c ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
We introduce a new kind of language model, which models whole sentences or utterances directly using the Maximum Entropy paradigm. The new model is conceptually simpler, and more naturally suited to modeling wholesentence phenomena, than the conditional ME models proposed to date. By avoiding the chain rule, the model treats each sentence or utterance as a "bag of features", where features are arbitrary computable properties of the sentence. The model is unnormalizable, but this does not interfere with training (done via sampling) or with use. Using the model is computationally straightforward. The main computational cost of training the model is in generating sample sentences from a Gibbs distribution. Interestingly, this cost has different dependencies, and is potentially lower, than in the comparable conditional ME model. 1 Motivation Conventional statistical language models estimate the probability of an sentence s by using the chain rule to decompose it into a product of condit...