## Exponential Priors for Maximum Entropy Models (2003)

### Cached

### Download Links

Venue: | In Proceedings of the Annual Meeting of the Association for Computational Linguistics |

Citations: | 58 - 0 self |

### BibTeX

@INPROCEEDINGS{Goodman03exponentialpriors,

author = {Joshua Goodman and Msr-tr-coming Soon},

title = {Exponential Priors for Maximum Entropy Models},

booktitle = {In Proceedings of the Annual Meeting of the Association for Computational Linguistics},

year = {2003},

pages = {305--312}

}

### Years of Citing Articles

### OpenURL

### Abstract

this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors.

### Citations

2024 | Regression shrinkage and selection via the LASSO - Tibshirani - 1996 |

1140 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...odels have been widely used for a variety of tasks, including language modeling [17], part-of-speech tagging, prepositional phrase attachment, and parsing [15], word selection for machine translation =-=[2]-=-, and finding sentence boundaries [16]. They are also sometimes called logistic regression models, maximum likelihood exponential models, log-linear models, and are even equivalent to a form of percep... |

924 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1996
(Show Context)
Citation Context ...r (Equation 1), since those constraints change as the values of # i change. Furthermore, as we will describe in Section 3, discounting by a constant is a common technique for language model smoothing =-=[4]-=-, but one that has not previously been well justified; the Exponential prior gives some Bayesian justification. In Section 5 we will show that on two very di#erent tasks -- grammar checking and a coll... |

573 | Inducing features of random fields
- Pietra, Stephen, et al.
- 1997
(Show Context)
Citation Context ...nancial or river; f i (x, y) could be 1 if the context includes the word "money" and y is the financial sense; and # i would be a large positive number. Maxent models have several valuable p=-=roperties [7]-=-. The most important is constraint satisfaction. For a given f i , we can count how many times f i was observed in the training data with value y, observed[i] = # j f i (x j , y j ). For a model P # w... |

444 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...uccessive search directions and the initially preferred value of " the step size (page 130). 2 We show that an incredibly simple variation on a standard algorithm, Generalized Iterative Scaling (=-=GIS) [5]-=-, solves this problem. In particular, as we will show, while GIS uses an update rule of the form # i := # i + 1 f # log observed[i] expected[i] our modified algorithm uses a rule of the form # i := ma... |

393 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...ow discrimination features may cause overfitting, they do contain valuable information. Because of this, more recent approaches [17, page 38], [13] have tried techniques such as Good-Turing discounts =-=[8]-=-. There are a number of other approaches [11, 14] based on the fuzzy maxent framework [6]. Chen and Rosenfeld [3] give a more complete discussion of those approaches. Chen and Rosenfeld [3], following... |

296 | Improved backing-off for m-gram language modeling - Kneser, Ney - 1995 |

211 | Maximum Entropy Models for Natural Language Ambiguity Resolution
- Ratnaparkhi
- 1998
(Show Context)
Citation Context ...uction Conditional Maximum Entropy (maxent) models have been widely used for a variety of tasks, including language modeling [17], part-of-speech tagging, prepositional phrase attachment, and parsing =-=[15]-=-, word selection for machine translation [2], and finding sentence boundaries [16]. They are also sometimes called logistic regression models, maximum likelihood exponential models, log-linear models,... |

188 | Adaptive Statistical Language Modeling: A Maximum Entropy Approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ... Microsoft Way Redmond, WA 98052 http://www.research.microsoft.com 1 Introduction Conditional Maximum Entropy (maxent) models have been widely used for a variety of tasks, including language modeling =-=[17]-=-, part-of-speech tagging, prepositional phrase attachment, and parsing [15], word selection for machine translation [2], and finding sentence boundaries [16]. They are also sometimes called logistic r... |

183 | On structuring probabilistic dependences in stochastic language modeling - Ney, Essen, et al. - 1994 |

178 | A Maximum Entropy Approach to Identifying Sentence Boundaries
- Reynar, Ratnaparkhi
- 1997
(Show Context)
Citation Context ...iety of tasks, including language modeling [17], part-of-speech tagging, prepositional phrase attachment, and parsing [15], word selection for machine translation [2], and finding sentence boundaries =-=[16]-=-. They are also sometimes called logistic regression models, maximum likelihood exponential models, log-linear models, and are even equivalent to a form of perceptrons/single layer neural networks. In... |

88 | A survey of smoothing techniques for ME models
- Chen, Rosenfeld
- 2000
(Show Context)
Citation Context ...f severe overfitting. There have been a number of approaches to this problem, which we will discuss in more detail in Section 3. The most relevant approach, however, is the work of Chen and Rosenfeld =-=[3]-=-, who implemented a Gaussian prior for maxent models. They compared this technique to most of the previously implemented techniques, on a language modeling task, and concluded that it was consistently... |

88 |
Bayesian regularization and pruning using a Laplace prior, Neural Computation 7
- Williams
- 1995
(Show Context)
Citation Context ...ms; because the Laplacian does not have a continuous first derivative, and because the Exponential prior is bounded at 0, standard gradient descent type algorithms may exhibit poor behavior. Williams =-=[18] devo-=-tes a full ten pages to describing a somewhat heuristic approach for solving this problem, and even this discussion concludes "In summary it is left to the reader to supply the algorithms for det... |

32 | Classes for fast maximum entropy training
- Goodman
- 2001
(Show Context)
Citation Context ... ran experiments with language modeling, with mixed success. We used 1,000,000 words of training data (a small model, but one where smoothing matters) and a trigram model with a cluster-based speedup =-=[9]-=-. We evaluated on test data using the standard language modeling measure, perplexity, where lower scores are better. We tried six experiments: using Katz smoothing (a widely used version of Good-Turin... |

27 | Online feature selection using grafting - Perkins, Theiler - 2003 |

16 |
Adaptive statistical language modelling
- Lau
- 1994
(Show Context)
Citation Context ...hod, making this update equation much more complex and time consuming than the exponential prior. Good Turing discounting has been used or suggested for language modeling several times [17, page 38], =-=[13]-=-. In particular, it has been suggested to use an update of the form # k := # k + 1 f # log observed[k] # expected[k] where observed[k] # is the Good-Turing discounted value of observed[k]. This update... |

11 |
Improved backing-o# for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...not previously known. Chen and Goodman [4] performed an extensive comparison of di#erent smoothing (regularization) techniques for language modeling. They found that a version of Kneser-Ney smoothing =-=[12]-=- consistently was the best performing technique. Unfortunately, while there are partial theoretical justifications for Kneser-Ney smoothing, in terms of preserving marginals, one important part has pr... |

9 |
Mitigating the paucity of data problem
- Banko, Brill
- 2001
(Show Context)
Citation Context ...gram models, again optimized on held out data. We were inspired to use an exponential prior by an actual examination of a data set. In particular, we used the grammar-checking data of Banko and Brill =-=[1]-=-. We chose this set because there are commonly used versions both with small amounts of data (which is when we expect the prior to matter) and with large amounts of data (which is required to easily s... |

7 |
Extension to the maximum entropy method
- Newman
- 1977
(Show Context)
Citation Context ...ting, they do contain valuable information. Because of this, more recent approaches [17, page 38], [13] have tried techniques such as Good-Turing discounts [8]. There are a number of other approaches =-=[11, 14]-=- based on the fuzzy maxent framework [6]. Chen and Rosenfeld [3] give a more complete discussion of those approaches. Chen and Rosenfeld [3], following a suggestion of La#erty, implemented a Gaussian ... |

5 |
A method of maximum entropy estimation with relaxed constraints
- Khudanpur
- 1995
(Show Context)
Citation Context ...ting, they do contain valuable information. Because of this, more recent approaches [17, page 38], [13] have tried techniques such as Good-Turing discounts [8]. There are a number of other approaches =-=[11, 14]-=- based on the fuzzy maxent framework [6]. Chen and Rosenfeld [3] give a more complete discussion of those approaches. Chen and Rosenfeld [3], following a suggestion of La#erty, implemented a Gaussian ... |

3 | CFW: A collaborative filtering system using posteriors over weights of evidence
- Kadie, Meek, et al.
- 2002
(Show Context)
Citation Context ...ative-filtering style task, television show recommendation, based on Nielsen data. The dataset used, and the definition of a collaborative filtering (CF) score is the same as was used by Kadie et al. =-=[10]-=-, although our random train/test split is not the same, so the results are not strictly comparable. We first ran experiments with di#erent priors on a heldout section of the training data, and then us... |

1 |
Statistical modeling by maximum entropy
- Pietra, Pietra
- 1993
(Show Context)
Citation Context ...ecause of this, more recent approaches [17, page 38], [13] have tried techniques such as Good-Turing discounts [8]. There are a number of other approaches [11, 14] based on the fuzzy maxent framework =-=[6]-=-. Chen and Rosenfeld [3] give a more complete discussion of those approaches. Chen and Rosenfeld [3], following a suggestion of La#erty, implemented a Gaussian prior for maxent models. They compared t... |

1 | Supervised and semi-supervised sparse Bayesian classification - Figueiredo, Krishnapuram, et al. - 2003 |

1 | A collaborative filtering system using posteriors over weights of evidence - CFW |