Results 1 - 10
of
28
An Empirical Study of Smoothing Techniques for Language Modeling
, 1998
"... We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Br ..."
Abstract
-
Cited by 631 (19 self)
- Add to MetaCart
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods. 1
A Gaussian Prior for Smoothing Maximum Entropy Models
, 1999
"... In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data. Several smoothing methods for maximum entropy models have been proposed to address this problem, ..."
Abstract
-
Cited by 181 (1 self)
- Add to MetaCart
In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data. Several smoothing methods for maximum entropy models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in maximum entropy smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing n-gram language models. Because of the mature body of research in n-gram model smoothing and the close connection between maximum entropy and conventional n-gram models, this domain is well-suited to gauge the performance of maximum entropy smoothing methods. Over a large number of data sets, we find that an ME smoothing method proposed to us by Lafferty [1] performs as well as or better tha...
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
- Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract
-
Cited by 148 (15 self)
- Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
A survey of smoothing techniques for ME models
- IEEE Transactions on Speech and Audio Processing
, 2000
"... Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous r ..."
Abstract
-
Cited by 75 (1 self)
- Add to MetaCart
Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in ME smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing-gram language models. Because of the mature body of research in-gram model smoothing and the close connection between ME and conventional-gram models, this domain is well-suited to gauge the performance of ME smoothing methods. Over a large number of data sets, we find that fuzzy ME smoothing performs as well as or better than all other algorithms under consideration. We contrast this method with previous-gram smoothing methods to explain its superior performance. Index Terms—Exponential models, language modeling, maximum entropy, minimum divergence,-gram models, smoothing.
The Power of Amnesia
- Machine Learning
, 1994
"... We propose a learning algorithm for a variable memory length Markov process. Human communication, whether given as text, handwriting, or speech, has multi characteristic time scales. On short scales it is characterized mostly by the dynamics that generate the process, whereas on large scales, more s ..."
Abstract
-
Cited by 69 (4 self)
- Add to MetaCart
We propose a learning algorithm for a variable memory length Markov process. Human communication, whether given as text, handwriting, or speech, has multi characteristic time scales. On short scales it is characterized mostly by the dynamics that generate the process, whereas on large scales, more syntactic and semantic information is carried. For that reason the conventionally used fixed memory Markov models cannot capture effectively the complexity of such structures. On the other hand using long memory models uniformly is not practical even for as short memory as four. The algorithm we propose is based on minimizing the statistical prediction error by extending the memory, or state length, adaptively, until the total prediction error is sufficiently small. We demonstrate the algorithm by learning the structure of natural English text and applying the learned model to the correction of corrupted text. Using less than 3000 states the model's performance is far superior to that of fixe...
A Hierarchical Dirichlet Language Model
- Natural Language Engineering
, 1994
"... We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as `smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions ..."
Abstract
-
Cited by 67 (3 self)
- Add to MetaCart
We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as `smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources. Contents 1 Introduction 2 1.1 The bigram language model with smoothing 2 1.2 Any rational predictive procedure can be made Bayesian 3 2 An explicit model using Dirichlet priors 4 2.1 The inferences we will make 4 2.2 The likelihood function 5 2.3 What prior? 5 2.4 A convenient family of priors: Dirichlet distributions 5 2.5 ...
Building Probabilistic Models for Natural Language
, 1996
"... Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistic ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistically-trained models are an attractive alternative. These models are generally probabilistic, yielding a score reflecting sentence frequency instead of a binary grammaticality judgement. Probabilistic models of language are a fundamental tool in speech recognition for resolving acoustically ambiguous utterances. For example, we prefer the transcription forbear to four bear as the former string is far more frequent in English text. Probabilistic models also have application in optical character recognition, handwriting recognition, spelling correction, part-of-speech tagging, and machine translation. In this thesis, we investigate three problems involving the probabilistic modeling of languag...
Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones
, 1998
"... . We study Markov models whose state spaces arise from the Cartesian product of two or more discrete random variables. We show how to parameterize the transition matrices of these models as a convex combination---or mixture---of simpler dynamical models. The parameters in these models admit a simple ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
. We study Markov models whose state spaces arise from the Cartesian product of two or more discrete random variables. We show how to parameterize the transition matrices of these models as a convex combination---or mixture---of simpler dynamical models. The parameters in these models admit a simple probabilistic interpretation and can be fitted iteratively by an Expectation-Maximization (EM) procedure. We derive a set of generalized Baum-Welch updates for factorial hidden Markov models that make use of this parameterization. We also describe a simple iterative procedure for approximately computing the statistics of the hidden states. Throughout, we give examples where mixed memory models provide a useful representation of complex stochastic processes. Keywords: Markov models, mixture models, discrete time series 1. Introduction The modeling of time series is a fundamental problem in machine learning, with widespread applications. These include speech recognition (Rabiner, 1989), natu...
A hierarchical Bayesian language model based on Pitman–Yor processes
- In Coling/ACL, 2006. 9
, 2006
"... We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approxi ..."
Abstract
-
Cited by 48 (6 self)
- Add to MetaCart
We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney. 1
A Variable-Length Category-Based N-Gram Language Model
- IN PROCEEDINGS, IEEE ICASSP
, 1996
"... A language model based on word-category n-grams and ambiguous category membership with n increased selectively to trade compactness for performance is presented. The use of categories leads intrinsically to a compact model with the ability to generalise to unseen word sequences, and diminishes the s ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
A language model based on word-category n-grams and ambiguous category membership with n increased selectively to trade compactness for performance is presented. The use of categories leads intrinsically to a compact model with the ability to generalise to unseen word sequences, and diminishes the spareseness of the training data, thereby making larger n feasible. The language model implicitly involves a statistical tagging operation, which may be used explicitly to assign category assigments to untagged text. Experiments on the LOB corpus show the optimal model-building strategy to yield improved results with respect to conventional n-gram methods, and when used as a tagger, the model is seen to perform well in relation to a standard benchmark.

