## Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models (1999)

### Cached

### Download Links

Citations: | 21 - 5 self |

### BibTeX

@MISC{Chen99efficientsampling,

author = {Stanley F. Chen and Ronald Rosenfeld},

title = {Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

Conditional Maximum Entropy models have been successfully applied to estimating language model probabilities of the form , but are often too demanding computationally. Furthermore, the conditional framework does not lend itself to expressing global sentential phenomena. We have recently introduced a non-conditional Maximum Entropy language model which directly models the probability of an entire sentence or utterance. The model treats each utterance as a "bag of features," where features are arbitrary computable properties of the sentence. Using the model is computationally straightforward since it does not require normalization. Training the model requires efficient sampling of sentences from an exponential distribution.

### Citations

4075 |
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images
- Geman, Geman
- 1987
(Show Context)
Citation Context ...l sampling methods for estimating the values EP [f i ], and present results evaluating their relative efficacy. DellaPietra et al. [5] build a joint ME model of word spelling. They use Gibbs sampling =-=[8]-=- to generate a set of word spellings fs1 ; : : : ; sM g, and estimate EP [f i ]s1 M P M j=1 f i (s j ). Gibbs sampling is not efficient for sentence models, as the probability of a great many sentence... |

2546 |
Equation of State Calculations by Fast Computing Machines
- Metropolis, Rosenbluth, et al.
- 1953
(Show Context)
Citation Context ...imate EP [f i ]s1 M P M j=1 f i (s j ). Gibbs sampling is not efficient for sentence models, as the probability of a great many sentences must be computed to generate each sample. Metropolis sampling =-=[12]-=-, is more appropriate for this situation. An initial sentence is chosen randomly. For each word position in turn, a new word is proposed to replace the original word in that position, and this change ... |

1163 | A Maximum Entropy Approach to Natural Language Processing
- Berger
- 1996
(Show Context)
Citation Context ...pounded. 1.2. Conditional Maximum Entropy Models In the last few years, Maximum Entropy (ME, [9]) models have been successfully used to estimate conditional language probabilities of the form P (wjh) =-=[6, 11, 1, 15]-=- (as well as to model prepositional phrase attachment [14] and induce features of word spelling [5]). In using Maximum Entropy to model P (wjh), one major obstacle is the heavy computational requireme... |

757 |
Information theory and statistical mechanics
- Jaynes
- 1957
(Show Context)
Citation Context ...f every word in the current sentence causes small but systematic biases in probability estimation to be compounded. 1.2. Conditional Maximum Entropy Models In the last few years, Maximum Entropy (ME, =-=[9]-=-) models have been successfully used to estimate conditional language probabilities of the form P (wjh) [6, 11, 1, 15] (as well as to model prepositional phrase attachment [14] and induce features of ... |

748 | Class-based N-gram models of natural language
- Brown, deSouza, et al.
- 1992
(Show Context)
Citation Context ...othing [10], and used it as our prior p0 (s). We employed features that constrained the frequency of word n-grams (up to n=4), distance-two word n-grams (up to n=3) [15], and class ngrams (up to n=5) =-=[3]-=-. That is, we considered features of the form f ff (s) = # of times n-gram ff occurs in s : We partitioned our vocabulary (of 15,000 words) into 100, 300, and 1000 classes using the word classing algo... |

581 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...been successfully used to estimate conditional language probabilities of the form P (wjh) [6, 11, 1, 15] (as well as to model prepositional phrase attachment [14] and induce features of word spelling =-=[5]-=-). In using Maximum Entropy to model P (wjh), one major obstacle is the heavy computational requirements of training and using the model. These requirements are particularly severe because of the need... |

452 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...ihood solution of the exponential family. For more information, see [9, 1, 15]. The MDI or ME solution can be found by an iterative procedure such as the Generalized Iterative Scaling (GIS) algorithm =-=[4]-=-. GIS starts with arbitrarysi 's. At each iteration, the algorithm improves the f i g values by comparing the expectation of each feature under the current P to the target value, and modifying the ass... |

309 |
Improved Backing-Off for M-gram Language Modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ..., using Switchboard as our example domain. Our training data consisted of three million words of Switchboard text. We constructed a trigram model on this data using a variation of KneserNey smoothing =-=[10]-=-, and used it as our prior p0 (s). We employed features that constrained the frequency of word n-grams (up to n=4), distance-two word n-grams (up to n=3) [15], and class ngrams (up to n=5) [3]. That i... |

258 | A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language
- Rosenfeld
- 1996
(Show Context)
Citation Context ...pounded. 1.2. Conditional Maximum Entropy Models In the last few years, Maximum Entropy (ME, [9]) models have been successfully used to estimate conditional language probabilities of the form P (wjh) =-=[6, 11, 1, 15]-=- (as well as to model prepositional phrase attachment [14] and induce features of word spelling [5]). In using Maximum Entropy to model P (wjh), one major obstacle is the heavy computational requireme... |

184 |
On structuring probabilistic dependency in stochastic language modeling", Computer Speech & Language 8
- Ney
- 1994
(Show Context)
Citation Context ...ered features of the form f ff (s) = # of times n-gram ff occurs in s : We partitioned our vocabulary (of 15,000 words) into 100, 300, and 1000 classes using the word classing algorithm of Ney et al. =-=[3, 13] on o-=-ur training data. To select specific features we devised the following procedure. First, we generated an artificial corpus by sampling from our prior trigram distribution p0(s). This "trigram cor... |

88 |
Trigger-based language models: A maximum entropy approach
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...pounded. 1.2. Conditional Maximum Entropy Models In the last few years, Maximum Entropy (ME, [9]) models have been successfully used to estimate conditional language probabilities of the form P (wjh) =-=[6, 11, 1, 15]-=- (as well as to model prepositional phrase attachment [14] and induce features of word spelling [5]). In using Maximum Entropy to model P (wjh), one major obstacle is the heavy computational requireme... |

44 | Adaptive language modelling using minimum discriminant estimation
- Pietra, Pietra, et al.
- 1992
(Show Context)
Citation Context |

28 | Just-in-time language modelling
- Berger, Miller
- 1998
(Show Context)
Citation Context ...in all of our experiments. 3.3. Smoothing From equation (5) we can see that if E ~ p [f i ] = 0 then we will havesi ! \Gamma1. To smooth our model, we take the approach described by Berger and Miller =-=[2]-=-: We introduce a Gaussian prior onsi values and search for the maximum a posterior model instead of the maximum likelihood model. 4. FEATURE SELECTION In this section, we discuss feature selection and... |

28 |
The JanusRTk Switchboard/Callhome 1997 Evaluation System: Pronunciation Modeling
- Finke
- 1997
(Show Context)
Citation Context ...36.29 LM only 40.92 40.95 40.68 40.46 avg. rank 27.29 27.26 26.34 26.42 LM only 35.20 35.28 34.59 33.93 Table 3: Top-1 WER and average rank of best hypothesis using varying feature sets. Janus system =-=[7]-=- on a Switchboard/CallHome test set of 8,300 words. The trigram p0(s) served as baseline. For each model, we computed both the top-1 word error rate and the average rank of the least errorful hypothes... |

28 | A Whole Sentence Maximum Entropy Language Model
- Rosenfeld
- 1997
(Show Context)
Citation Context ...ch new value of h. 1.3. Whole Sentence Maximum Entropy Models We have recently introduced a new Maximum Entropy language model which directly models the probability of an entire sentence or utterance =-=[16]-=-. The new model is conceptually simpler, as well as more naturally suited to modeling whole-sentence phenomena, than the conditional ME models proposed earlier. By avoiding the chain rule, the model t... |

1 |
A maximum entropy approach for prepositional phrase attachment
- Ratnaparkhi, Reynar, et al.
- 1994
(Show Context)
Citation Context ...rs, Maximum Entropy (ME, [9]) models have been successfully used to estimate conditional language probabilities of the form P (wjh) [6, 11, 1, 15] (as well as to model prepositional phrase attachment =-=[14]-=- and induce features of word spelling [5]). In using Maximum Entropy to model P (wjh), one major obstacle is the heavy computational requirements of training and using the model. These requirements ar... |