## Maximum entropy density estimation and modeling geographic distributions of species (2007)

Citations: | 5 - 0 self |

### BibTeX

@TECHREPORT{Dudík07maximumentropy,

author = {Miroslav Dudík},

title = {Maximum entropy density estimation and modeling geographic distributions of species},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used density-estimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but theory explaining their properties is often missing or needs to be derived for each case separately. In this dissertation, we propose a unified treatment for a large and general class of smoothing techniques. We provide fully general guarantees on their statistical performance and propose optimization algorithms with complete convergence proofs. As special cases, we can easily derive performance guarantees for many known regularization types including L1 and L2-squared regularization. Furthermore, our general approach enables us to derive entirely new regularization functions with superior statistical guarantees. The new regularization functions use information about the structure of the feature space, incorporate information about sample selection bias, and combine information across several related density-estimation tasks. We propose algorithms solving a large and general subclass of generalized maxent problems, including all

### Citations

8972 | The Nature of Statistical Learning Theory - Vapnik - 1995 |

8563 |
Elements of Information Theory
- Cover, Thomas
- 2006
(Show Context)
Citation Context ...lative entropy (or maximum entropy) density ˆp=argmin D(p∥ q0) . p∈P One might ask, “Why maximize the likelihood?" and there are several justifications, including optimal gambling and optimal coding (=-=Cover and Thomas, 1991-=-). Are these justifications less arbitrary than the justifications of maxent by information theory and axiomatic derivations? We believe that there is a difference. Instead of imposing desirability co... |

6038 | A Mathematical Theory of Communication - Shannon - 1948 |

3668 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...valent constraint ‖E ˜π[f ]−Ep[f ]‖ 2 2 ≤ β2 , and obtain an equivalent primal min p∈∆ D(p∥ q0) subject to ‖E ˜π[f ]−Ep[f ]‖ 2 2 ≤ β2 . (4.5) If β>0 then, by Lagrange duality and Slater’s conditions (=-=Boyd and Vandenberghe, 2004-=-, Chapter 5), the value of Eq. (4.5) is the same as the value of max µ≥0 min [ D(p∥ q0)+µ(‖E ˜π[f ]−Ep[f ]‖ p∈∆ 2 2− β2 ] ) . (4.6) The max-min value of Eq. (4.6) is attained at the saddle-point of th... |

3265 | Variational Analysis
- Rockafellar, Wets
- 1997
(Show Context)
Citation Context ...) , which simply states that the graph of a convex function lies above its tangent (see Fig. 2.4). It turns out that the conjugate ψ ∗ is a closed proper convex function and ψ ∗∗ = ψ (for a proof see =-=Rockafellar, 1970-=-, Corollary 12.2.1). 2 In this work we use several examples of closed proper convex functions. The first of them is the relative entropy, viewed as a function of its first argument and 2 Convex conjug... |

2028 | Online learning with kernels
- Kivinen, Smola, et al.
- 2004
(Show Context)
Citation Context ...Hilbert-space equivalents. In machine learning, the most prominent examples of Hilbert spaces are reproducing kernel Hilbert spaces, used heavily in support vector machine literature (see for example =-=Schölkopf and Smola, 2002-=-). A separate line of generalizations arises by replacing the ℓ2-ball constraints by ellipsoid constraints. These are represented using a positive definite matrix A, defining the potential and regular... |

1490 | Probability inequalities for sums of bounded random variables
- Hoeffding
(Show Context)
Citation Context ...cal error inequalities used throughout this dissertation. All of the results are adapted from Devroye et al. (1996). Theorem A.1 (Hoeffding’s inequality, Theorem 8.1 of Devroye et al., 1996; first in =-=Hoeffding, 1963-=-). Let X1,..., Xm be independent random variables such that X i ∈ [0,1] with probability one. Denote their average by ˜Xm = (∑m i=1 X ) i /m. Then, for any ε>0, P ( ˜Xm−E[ ˜Xm]≥ε ) ≤ e −2ε2 m and P ( ... |

1272 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ...al is defined. Total-variation regularization has been successfully applied, for example, in image restoration (Strong and Chan, 2003). The approach is similar to smoothing splines (see, for example, =-=Wahba, 1990-=-; or Hastie et al., 2001, Chapter 5), which minimize L ˜π(g)+β ∫ ( tmax 2 d g(t) tmin dt 2 )2 dt . Thus, responses obtained by the ℓ1-regularized maxent with threshold features can be viewed as the ℓ1... |

1268 |
Sample Selection Bias as a Specification Error
- Heckman
- 1979
(Show Context)
Citation Context .... Related Work A traditional field where sample selection bias arises is econometrics. In econometrics, the data from surveys is affected by factors such as attrition, nonresponse and self selection (=-=Heckman, 1979-=-; Groves, 1989; Little and Rubin, 2002). An approach to coping with sample selection bias has been suggested by Heckman (1979) in linear regression. Here the bias is first estimated and then a transfo... |

1152 | Information Theory and Statistics - Kullback - 1968 |

1122 | Statistical Analysis with Missing Data - Little, Rubin - 1987 |

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... by Jaynes (1957), and has since been used in many areas outside statistical mechanics (Kapur and Kesavan, 1992). In computer science, it has been particularly popular in natural language processing (=-=Berger et al., 1996-=-; Della Pietra et al., 1997). In maxent, one is given a set of known constraints on the target distribution. The target distribution is then estimated by a distribution of maximum entropy satisfying t... |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...terms of the VC dimension by Sauer’s lemma (Vapnik and Chervonenkis, 1971; Sauer, 1972): s(F, m)≤ d(F) ∑ i=0 ( ) m . i If m>2d(F) then the right-hand side of Sauer’s lemma can be further bounded (see =-=Devroye et al., 1996-=-, Theorem 13.3), yielding the simpler inequality lns(F, m)≤ d(F)ln(em/d(F)) . (3.29) A central result of VC theory is the uniform convergence of empirical averages of feature classes with finite VC di... |

945 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...of F is the largest number of samples for which all possible labelings exist: d(F)=max{m : s(F, m)=2 m } . The growth function can be bounded in terms of the VC dimension by Sauer’s lemma (Vapnik and =-=Chervonenkis, 1971-=-; Sauer, 1972): s(F, m)≤ d(F) ∑ i=0 ( ) m . i If m>2d(F) then the right-hand side of Sauer’s lemma can be further bounded (see Devroye et al., 1996, Theorem 13.3), yielding the simpler inequality lns(... |

865 |
The Elements of
- Hastie, Tibshirani, et al.
- 2009
(Show Context)
Citation Context ...otal-variation regularization has been successfully applied, for example, in image restoration (Strong and Chan, 2003). The approach is similar to smoothing splines (see, for example, Wahba, 1990; or =-=Hastie et al., 2001-=-, Chapter 5), which minimize L ˜π(g)+β ∫ ( tmax 2 d g(t) tmin dt 2 )2 dt . Thus, responses obtained by the ℓ1-regularized maxent with threshold features can be viewed as the ℓ1 versions of solutions o... |

668 | Information theory and statistical mechanics - Jaynes - 1957 |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...since been used in many areas outside statistical mechanics (Kapur and Kesavan, 1992). In computer science, it has been particularly popular in natural language processing (Berger et al., 1996; Della =-=Pietra et al., 1997-=-). In maxent, one is given a set of known constraints on the target distribution. The target distribution is then estimated by a distribution of maximum entropy satisfying the given constraints. The c... |

539 |
The meaning and use of the area under a receiver operating characteristic (ROC) curve
- Hanley, McNeil
- 1982
(Show Context)
Citation Context ... equal to the negative of its relative entropy from the default: −D(π∥ q0). Another performance measure, applicable to any species-distribution modeling method, is the area under the ROC curve (AUC) (=-=Hanley and McNeil, 1982-=-), which uses a binary-labeled test set to measure the quality of a ranking of map cells. Specifically, the AUC is the probability that a randomly chosen test positive will be ranked above a randomly ... |

530 |
Theory of point estimation
- Lehmann
- 1983
(Show Context)
Citation Context ...etation of maxent as maximum likelihood has been suggested as an alternative justification of maxent (Jaynes, 1978). However, the maximum-likelihood setting in classical statistics (see, for example, =-=Lehmann and Casella, 1998-=-, Chapter 6) differs from the maxent setting in several aspects. First, in maximum likelihood, the true distribution is assumed to be from the same family as the distributions over which the likelihoo... |

504 | Model selection and estimation in regression with grouped variables
- Yuan, Lin
(Show Context)
Citation Context ...uld set βg ∝ D2(f g)/ � m. When each group consists of exactly one feature, we obtain ℓ1 regularization. In the general case, we obtain the regularization known from linear models as the group lasso (=-=Yuan and Lin, 2006-=-). According to our guarantees, we benefit from partitioning the variables into groups as long as G∑ ‖λ g=1 ⋆ g‖2D2(f g) � lnG≤‖λ ⋆ ‖2D2(f ) . (3.15) The leading � lnG on the left-hand side comes from... |

495 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...ing solutions to ill-posed problems. In statistics, regularization was first introduced implicitly as shrinkage (Stein, 1956; James and Stein, 1961), and later explicitly as part of ridge regression (=-=Hoerl and Kennard, 1970-=-). The main idea is to include in the objective a penalty for the ruggedness of the solution. The goal is to remove some of the noise present in finite sampling and to make the optimum unique. The two... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...s, which we view as a surrogate for the difference between the primal and dual objective rather than a bound on the change in the dual objective. Standard maxent algorithms such as iterative scaling (=-=Darroch and Ratcliff, 1972-=-; Della Pietra et al., 1997), gradient descent, Newton and quasi-Newton methods (CesaBianchi et al., 1994; Malouf, 2002; Salakhutdinov et al., 2003), and their regularized versions (Lau, 1994; William... |

390 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ... m i /m, and the average variance by σ = i=1 V[X i] ) /m. Then, for any ε>0, P ( ˜Xm > ε ) ( ≤ exp − mε2 2σ2 ) +2ε/3 Theorem A.3 (McDiarmid’s inequality, Theorem 9.2 of Devroye et al., 1996; first in =-=McDiarmid, 1989-=-). Let X1,..., Xm be independent random variables taking values in a set A and assume that s : A m → R satisfies sup x1,...,xm,x ′ i ∈A ∣s(x1,..., xm)− s(x1,..., xi−1, x ′ i , xi+1,..., xm) ∣ ≤ ci , 1... |

365 | Elad.“Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization - Donoho, Michael - 2003 |

280 |
Data Analysis Using Regression and Multilevel/Hierarchial Models
- Gelman, Hill
(Show Context)
Citation Context ...s are influenced by the estimates for which we have more confidence; estimates from large sample sizes are less influenced by others. In statistics, this is known as hierarchical/multilevel modeling (=-=Gelman and Hill, 2007-=-) or shrinkage, introduced by Stein (1956) and James and Stein (1961). In machine learning, hierarchical models have been used, 14for example, by McCallum et al. (1998) and Teh et al. (2005). These m... |

269 | The foundations of cost-sensitive learning
- Elkan
- 2001
(Show Context)
Citation Context ...been recently considered for classification problems by Zadrozny (2004). Here the goal is to learn a decision rule from a biased sample. The problem is closely related to cost-sensitive 118learning (=-=Elkan, 2001-=-; Zadrozny et al., 2003) and the same techniques such as resampling or differential weighting of samples apply. However, the methods of the previous two approaches do not apply directly to density est... |

263 |
Estimation with quadratic loss
- James, Stein
- 1956
(Show Context)
Citation Context ...Tikhonov (1963b,a), Ivanov (1962), and Phillips (1962) as a method of finding solutions to ill-posed problems. In statistics, regularization was first introduced implicitly as shrinkage (Stein, 1956; =-=James and Stein, 1961-=-), and later explicitly as part of ridge regression (Hoerl and Kennard, 1970). The main idea is to include in the objective a penalty for the ruggedness of the solution. The goal is to remove some of ... |

259 |
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...sfy the following two properties (B1) B(a∥ b)≥0 (B2) if B(at∥ bt)→0 and bt→b ⋆ then at→b ⋆ , where B stands for either D or ˜D. These properties are motivated by the formalism of Bregman divergences (=-=Bregman, 1967-=-; Censor and Lent, 1981; Censor and Zenios, 1997), which generalize some common distance measures such as the squared Euclidean distance. 3 Next example of a closed proper convex function is a convex ... |

258 |
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability
- Csiszár
- 1975
(Show Context)
Citation Context ... constraints. The most common are equality constraints on feature expectations, introduced in the original papers of Jaynes and Kullback. Although other types of constraints appear in the literature (=-=Csiszár, 1975-=-; Jaynes, 1978; Shore and Johnson, 1980; Khudanpur, 1995), they have received little attention in practical applications until the 2000s. Yet, according to the max-min likelihood interpretation, the c... |

256 |
Parallel Optimization: Theory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...B(a∥ b)≥0 (B2) if B(at∥ bt)→0 and bt→b ⋆ then at→b ⋆ , where B stands for either D or ˜D. These properties are motivated by the formalism of Bregman divergences (Bregman, 1967; Censor and Lent, 1981; =-=Censor and Zenios, 1997-=-), which generalize some common distance measures such as the squared Euclidean distance. 3 Next example of a closed proper convex function is a convex indicator of a closed convex set C ⊆ R n , denot... |

242 | A maximum entropy approach to adaptive statistical learning modeling - Rosenfeld - 1996 |

236 | Improving text classification by shrinkage in a hierarchy of classes - McCallum, Rosenfeld, et al. - 1998 |

236 |
On the density of families of sets
- Sauer
- 1972
(Show Context)
Citation Context ...number of samples for which all possible labelings exist: d(F)=max{m : s(F, m)=2 m } . The growth function can be bounded in terms of the VC dimension by Sauer’s lemma (Vapnik and Chervonenkis, 1971; =-=Sauer, 1972-=-): s(F, m)≤ d(F) ∑ i=0 ( ) m . i If m>2d(F) then the right-hand side of Sauer’s lemma can be further bounded (see Devroye et al., 1996, Theorem 13.3), yielding the simpler inequality lns(F, m)≤ d(F)ln... |

232 |
Multivariate adaptive regression splines (with discussion
- Friedman
- 1991
(Show Context)
Citation Context ...es. For example, f2 could be replaced by f ′ 2 (x)=h(v1(x);θ1,v1;min)h(v2(x);θ2,v2;max) . In regression settings, path hinge features are used for example in multivariate adaptive regression splines (=-=Friedman, 1991-=-). Again, if smooth first or second derivatives are desired, it is possible to use products of higher-order splines. 2.3 Overfitting and Smoothing As mentioned in Chapter 1, maxent can severely overfi... |

229 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...l objective. Standard maxent algorithms such as iterative scaling (Darroch and Ratcliff, 1972; Della Pietra et al., 1997), gradient descent, Newton and quasi-Newton methods (CesaBianchi et al., 1994; =-=Malouf, 2002-=-; Salakhutdinov et al., 2003), and their regularized versions (Lau, 1994; Williams, 1995; Chen and Rosenfeld, 2000; Kazama and Tsujii, 2003; Goodman, 2004; Krishnapuram et al., 2005) perform a sequenc... |

202 |
Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy
- Shore, Johnson
- 1980
(Show Context)
Citation Context ...e entropy should be appropriate for the density estimation task. Even though a large body of research seems satisfied with the purely information-theoretic motivation (see, for example, references in =-=Shore and Johnson, 1980-=-), the apparent mismatch between the task at hand and the maximum-entropy principle prompted a large body of theoretical research, resulting in a variety of theoretical justifications. We mention thre... |

179 |
Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems
- Csiszár
- 1991
(Show Context)
Citation Context ...he unknown sample-generating distribution. 1.1.4 Axiomatic Approaches The problem of statistical inference is addressed more directly by axiomatic approaches (Shore and Johnson, 1980; Skilling, 1988; =-=Csiszár, 1991-=-). These approaches begin by formulating a set of properties desirable for consistent statistical inference, such as invariance under changes of coordinates and consistency under decompositions 7into... |

178 | Elementary Principles in Statistical Mechanics - Gibbs - 1902 |

176 |
Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution
- Stein
- 1956
(Show Context)
Citation Context ...ntroduced by Tikhonov (1963b,a), Ivanov (1962), and Phillips (1962) as a method of finding solutions to ill-posed problems. In statistics, regularization was first introduced implicitly as shrinkage (=-=Stein, 1956-=-; James and Stein, 1961), and later explicitly as part of ridge regression (Hoerl and Kennard, 1970). The main idea is to include in the objective a penalty for the ruggedness of the solution. The goa... |

150 |
Survey errors and survey costs
- Groves
- 1989
(Show Context)
Citation Context ...A traditional field where sample selection bias arises is econometrics. In econometrics, the data from surveys is affected by factors such as attrition, nonresponse and self selection (Heckman, 1979; =-=Groves, 1989-=-; Little and Rubin, 2002). An approach to coping with sample selection bias has been suggested by Heckman (1979) in linear regression. Here the bias is first estimated and then a transform of the esti... |

146 |
Representing twentieth-century space-time climate variability. Part I. Development of a 1961-90 mean monthly terrestrial climatology
- New
- 1999
(Show Context)
Citation Context ...e daily temperature and temperature range. The first three derive from a digital elevation model for North America USGS (2001), and the remaining four were interpolated from weather station readings (=-=New et al., 1999-=-). Each environmental variable is defined over a 386 × 286 grid, of which 58,065 points have data for all environmental variables. We used linear, quadratic, product, and threshold features. The remai... |

143 | A model of inductive bias learning
- Baxter
- 2000
(Show Context)
Citation Context ...machine learning, hierarchical models have been used, 14for example, by McCallum et al. (1998) and Teh et al. (2005). These methods are also related to multitask or transfer learning (Caruana, 1993; =-=Baxter, 2000-=-; Raina et al., 2006) In hierarchical maximum entropy, we assume that we are given a fixed class hierarchy. We fit the joint distribution of all classes, placing constraints on individual class distri... |

142 |
Where do we stand on maximum entropy
- Jaynes
- 1978
(Show Context)
Citation Context ... empirical averages is equivalent to maximum likelihood in an exponential family. The dual interpretation of maxent as maximum likelihood has been suggested as an alternative justification of maxent (=-=Jaynes, 1978-=-). However, the maximum-likelihood setting in classical statistics (see, for example, Lehmann and Casella, 1998, Chapter 6) differs from the maxent setting in several aspects. First, in maximum likeli... |

136 | Feature selection, L1 vs. L2 regularization, and rotational invariance
- NG
(Show Context)
Citation Context ...ent and logistic regression, which is a conditional version of maxent, with various types of regularization, such as ℓ1-style regularization (Khudanpur, 1995; Williams, 1995; Kazama and Tsujii, 2003; =-=Ng, 2004-=-; Goodman, 2004; Krishnapuram et al., 2005), ℓ 2 2-style regularization (Lau, 1994; Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001; Zhang, 2005) as well as some other types of regularization suc... |

133 | Mathematical Theory of Connecting Networks and Telephone Traffic - Beneš - 1965 |

123 | The alternating decision tree learning algorithm - Freund, Mason - 1999 |

115 | Relative loss bounds for on-line density estimation with the exponential family of distributions - AZOURY, WARMUTH - 2001 |

113 | Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transaction on pattern analysis and machine learning
- Krishnapuram, Carin, et al.
- 2005
(Show Context)
Citation Context ...hods (CesaBianchi et al., 1994; Malouf, 2002; Salakhutdinov et al., 2003), and their regularized versions (Lau, 1994; Williams, 1995; Chen and Rosenfeld, 2000; Kazama and Tsujii, 2003; Goodman, 2004; =-=Krishnapuram et al., 2005-=-) perform a sequence of featureweight updates until convergence. In each step, they update all feature weights. This is impractical when the number of features is very large. Instead, we propose a seq... |

106 | Cost-sensitive learning by cost-proportionate example weighting
- Zadrozny, Langford, et al.
- 2003
(Show Context)
Citation Context ... considered for classification problems by Zadrozny (2004). Here the goal is to learn a decision rule from a biased sample. The problem is closely related to cost-sensitive 118learning (Elkan, 2001; =-=Zadrozny et al., 2003-=-) and the same techniques such as resampling or differential weighting of samples apply. However, the methods of the previous two approaches do not apply directly to density estimation where the setup... |

101 | A technique for the numerical solution of certain integral equation of the first kind - Phillips - 1962 |