## Maximum entropy distribution estimation with generalized regularization (2006)

### Cached

### Download Links

- [www.cs.princeton.edu:80]
- [www.cs.princeton.edu]
- [www.cs.princeton.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. Annual Conf. Computational Learning Theory |

Citations: | 26 - 1 self |

### BibTeX

@INPROCEEDINGS{Dudík06maximumentropy,

author = {Miroslav Dudík and Robert E. Schapire},

title = {Maximum entropy distribution estimation with generalized regularization},

booktitle = {Proc. Annual Conf. Computational Learning Theory},

year = {2006},

publisher = {Springer Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. We present a unified and complete account of maximum entropy distribution estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we can easily derive performance guarantees for many known regularization types, including ℓ1, ℓ2, ℓ 2 2 and ℓ1 + ℓ 2 2 style regularization. Furthermore, our general approach enables us to use information about the structure of the feature space or about sample selection bias to derive entirely new regularization functions with superior guarantees. We propose an algorithm solving a large and general subclass of generalized maxent problems, including all discussed in the paper, and prove its convergence. Our approach generalizes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. 1

### Citations

3668 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...e functions. 3 Convex Analysis Background Throughout this paper we make use of convex analysis. The most relevant concepts are convex conjugacy and Fenchel’s duality which we introduce here (see also =-=[20, 21]-=-). Consider a function ψ : Rn → (−∞, ∞]. The effective domain of ψ is the set dom ψ = {u ∈ Rn | ψ(u) < ∞}. A point u where ψ(u) < ∞ is called feasible. The epigraph of ψ is the set of points above its... |

3265 | Variational Analysis
- Rockafellar, Wets
- 1997
(Show Context)
Citation Context ...e functions. 3 Convex Analysis Background Throughout this paper we make use of convex analysis. The most relevant concepts are convex conjugacy and Fenchel’s duality which we introduce here (see also =-=[20, 21]-=-). Consider a function ψ : Rn → (−∞, ∞]. The effective domain of ψ is the set dom ψ = {u ∈ Rn | ψ(u) < ∞}. A point u where ψ(u) < ∞ is called feasible. The epigraph of ψ is the set of points above its... |

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...ach to probability distribution estimation was first proposed by Jaynes [1], and has since been used in many areas of computer science and statistical learning, especially natural language processing =-=[2, 3]-=-, and more recently in species habitat modeling [4]. In maxent, one is given a set of samples from a target distribution over some space, and a set of known constraints on the distribution. The distri... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...ach to probability distribution estimation was first proposed by Jaynes [1], and has since been used in many areas of computer science and statistical learning, especially natural language processing =-=[2, 3]-=-, and more recently in species habitat modeling [4]. In maxent, one is given a set of samples from a target distribution over some space, and a set of known constraints on the distribution. The distri... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...w as a surrogate for the difference between the primal and dual objective rather than a bound on the change in the dual objective. There are many standard maxent algorithms, such as iterative scaling =-=[3, 15]-=-, gradient descent, Newton and quasi-Newton methods [16] and their regularized versions [5, 6, 9, 10, 17]. In this paper, we focus on an algorithm that performs sequential updates of feature weights s... |

229 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...dual objective rather than a bound on the change in the dual objective. There are many standard maxent algorithms, such as iterative scaling [3, 15], gradient descent, Newton and quasi-Newton methods =-=[16]-=- and their regularized versions [5, 6, 9, 10, 17]. In this paper, we focus on an algorithm that performs sequential updates of feature weights similarly to boosting and sequential algorithms considere... |

204 | Logistic regression, AdaBoost and Bregman distances
- Collins, Schapire, et al.
- 2004
(Show Context)
Citation Context ...ng techniques that unify previous approaches and extend them to a more general setting. Specifically, our unified approach generalizes techniques based on information geometry and Bregman divergences =-=[3, 14]-=- as well as those based more directly on compactness [11]. The main novel ingredient is a modified definition of an auxiliary function, a customary measure of progress, which we view as a surrogate fo... |

136 | Feature selection, L1 vs. L2 regularization, and rotational invariance
- NG
(Show Context)
Citation Context ...arization equals a norm raised to a power greater than one. With the exception of [8, 11, 12], previous work does not include guarantees applicable to our case, albeit Krishnapuram et al. [17] and Ng =-=[18]-=- give guarantees for ℓ1-regularized logistic regression. 2 Preliminaries The goal is to estimate an unknown target distribution π over a sample space X based on samples x1, . . . , xm ∈ X . We assume ... |

113 | Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transaction on pattern analysis and machine learning
- Krishnapuram, Carin, et al.
- 2005
(Show Context)
Citation Context ...n the change in the dual objective. There are many standard maxent algorithms, such as iterative scaling [3, 15], gradient descent, Newton and quasi-Newton methods [16] and their regularized versions =-=[5, 6, 9, 10, 17]-=-. In this paper, we focus on an algorithm that performs sequential updates of feature weights similarly to boosting and sequential algorithms considered in [11, 14]. Sequential updates are especially ... |

86 | R: A Survey of Smoothing Techniques for ME Models
- SF, Rosenfeld
(Show Context)
Citation Context ...Gibbs distributions is too expressive and the algorithm overfits. Common approaches to counter overfitting are regularization [5–8], introduction of a prior [9], feature selection [2, 3], discounting =-=[5, 6]-=- and constraint relaxation [10, 11]. Thus, there are many ways to control overfitting in maxent calling for a general treatment. In this work, we study a generalized form of maxent. Although mentioned... |

80 | Boosting and maximum likelihood for exponential models
- Lebanon, Lafferty
- 2001
(Show Context)
Citation Context ...whereas, as shown in [11], ℓ1-regularized maxent corresponds to box constraints |˜π[fj]−p[fj]| ≤ β, which can be represented by U (1)(u) = IC(u) where C = ˜π[f]+[−β, β] n . Finally, as pointed out in =-=[6, 7]-=-, ℓ2 2-regularized maxent is obtained using the potential U (2)(u) = ‖˜π[f] − u‖2 2/(2α) which incurs an ℓ2 2-style penalty for deviating from empirical averages. To simplify the exposition, we use th... |

57 | Exponential priors for maximum entropy models
- Goodman
- 2004
(Show Context)
Citation Context .... From the dual perspective, the family of Gibbs distributions is too expressive and the algorithm overfits. Common approaches to counter overfitting are regularization [5–8], introduction of a prior =-=[9]-=-, feature selection [2, 3], discounting [5, 6] and constraint relaxation [10, 11]. Thus, there are many ways to control overfitting in maxent calling for a general treatment. In this work, we study a ... |

52 | Performance guarantees for regularized maximum entropy density estimation
- Dudík, Phillips, et al.
- 2004
(Show Context)
Citation Context ...sive and the algorithm overfits. Common approaches to counter overfitting are regularization [5–8], introduction of a prior [9], feature selection [2, 3], discounting [5, 6] and constraint relaxation =-=[10, 11]-=-. Thus, there are many ways to control overfitting in maxent calling for a general treatment. In this work, we study a generalized form of maxent. Although mentioned by other authors as fuzzy maxent [... |

40 |
Information theory and statistical mechanics, Phys. Rev
- Jaynes
- 1957
(Show Context)
Citation Context ...y and Bregman divergences as well as those based more directly on compactness. 1 Introduction The maximum entropy (maxent) approach to probability distribution estimation was first proposed by Jaynes =-=[1]-=-, and has since been used in many areas of computer science and statistical learning, especially natural language processing [2, 3], and more recently in species habitat modeling [4]. In maxent, one i... |

37 | A.: Unifying divergence minimization and statistical inference via convex duality
- Altun, Smola
- 2006
(Show Context)
Citation Context ...r properties of the feature space. An alternative line of generalizations arises by replacing relative entropy in the primal objective by an arbitrary Bregman or Csiszár divergence along the lines of =-=[12, 14]-=-. Analogous duality results as well as a modified algorithm apply in the new setting, but ⊓⊔performance guarantees do not directly translate to the case when divergences are derived from samples. Div... |

16 |
Adaptive Statistical Language Modelling
- Lau
- 1994
(Show Context)
Citation Context ...Gibbs distributions is too expressive and the algorithm overfits. Common approaches to counter overfitting are regularization [5–8], introduction of a prior [9], feature selection [2, 3], discounting =-=[5, 6]-=- and constraint relaxation [10, 11]. Thus, there are many ways to control overfitting in maxent calling for a general treatment. In this work, we study a generalized form of maxent. Although mentioned... |

9 |
Extension to the ME method
- Newman
- 1977
(Show Context)
Citation Context ...version of maxent, with ℓ1-style regularization [9–11, 17, 18], ℓ 2 2-style regularization [5–8] as well as some other types of regularization such as ℓ1+ ℓ 2 2-style [10] and ℓ2-style regularization =-=[19]-=-. In a recent work, Altun and Smola [12] explore regularized formulations (with duality and performance guarantees) where the entropy is replaced by an arbitrary Bregman or Csiszár divergence and regu... |

5 |
Class-size independent generalization analysis of some discriminative multi-category classification
- Zhang
- 2004
(Show Context)
Citation Context ...lity and performance guarantees) where the entropy is replaced by an arbitrary Bregman or Csiszár divergence and regularization equals a norm raised to a power greater than one. With the exception of =-=[8, 11, 12]-=-, previous work does not include guarantees applicable to our case, albeit Krishnapuram et al. [17] and Ng [18] give guarantees for ℓ1-regularized logistic regression. 2 Preliminaries The goal is to e... |

1 |
R.E.: A ME approach to species distribution modeling
- Phillips, Dudík, et al.
- 2004
(Show Context)
Citation Context ...x) and use Fenchel’s duality: min p∈∆ [D(p ‖ q0) + U(p[f])] = min p∈∆ [D(p ‖ q0) + U(F p)] = sup λ∈Rn [ ( ∑ − ln x∈X q0(x) exp { (F ⊤ } λ)x ) − U ∗ ] (−λ) = sup λ∈Rn [− ln Zλ − U ∗ (−λ)] . (5) In Eq. =-=(4)-=-, we apply Theorem 1. We use (F ⊤ λ)x to denote the entry of F ⊤ λ indexed by x. In Eq. (5), we note that (F ⊤ λ)x = λ · f(x) and thus the expression inside the logarithm equals the normalization cons... |

1 |
Evaluation and extension of ME models with inequality constraints
- Kazama, Tsujii
(Show Context)
Citation Context ...sive and the algorithm overfits. Common approaches to counter overfitting are regularization [5–8], introduction of a prior [9], feature selection [2, 3], discounting [5, 6] and constraint relaxation =-=[10, 11]-=-. Thus, there are many ways to control overfitting in maxent calling for a general treatment. In this work, we study a generalized form of maxent. Although mentioned by other authors as fuzzy maxent [... |

1 |
S.J.: Correcting sample selection bias in ME density estimation
- Dudík, Schapire, et al.
- 2006
(Show Context)
Citation Context ... corresponding to constraints on variances or covariances of the base features. The second case is when the sample selection process is known to be biased. Both of these cases were studied previously =-=[4, 13]-=-. Here, we apply our general framework to derive improved generalization bounds using an entirely new form of regularization. These results improve on bounds for previous forms of regularization by up... |