Results 1  10
of
28
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 203 (43 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Inequalities between Entropy and Index of Coincidence derived from Information Diagrams
 IEEE Trans. Inform. Theory
, 2001
"... To any discrete probability distribution P we can associate its entropy H(P) = − � pi ln pi and its index of coincidence IC(P) = � p 2 i. The main result of the paper is the determination of the precise range of the map P � (IC(P), H(P)). The range looks much like that of the map P � (Pmax, H(P ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
To any discrete probability distribution P we can associate its entropy H(P) = − � pi ln pi and its index of coincidence IC(P) = � p 2 i. The main result of the paper is the determination of the precise range of the map P � (IC(P), H(P)). The range looks much like that of the map P � (Pmax, H(P)) where Pmax is the maximal point probability, cf. research from 1965 (Kovalevskij [18]) to 1994 (Feder and Merhav [7]). The earlier results, which actually focus on the probability of error 1 − Pmax rather than Pmax, can be conceived as limiting cases of results obtained by methods here presented. Ranges of maps as those indicated are called Information Diagrams. The main result gives rise to precise lower as well as upper bounds for the entropy function. Some of these bounds are essential for the exact solution of certain problems of universal coding and prediction for Bernoulli sources. Other applications concern Shannon theory (relations betweeen various measures of divergence), statistical decision theory and rate distortion theory. Two methods are developed. One is topological, another involves convex analysis and is based on a “lemma of replacement ” which is of independent interest in relation to problems of optimization of mixed type (concave/convex optimization).
Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation
"... Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively timeconsuming to do separately for each species, or unreliable for small or biased ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively timeconsuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use ‘‘default settings’’, tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presenceonly data. We evaluate our method on independently collected highquality presenceabsence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce ‘‘hinge features’ ’ that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore ‘‘background sampling’’ strategies that cope with sample selection bias and decrease modelbuilding time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presenceonly data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model
On Bayesian bounds
 In Proceedings of the 23rd International Conference on Machine Learning
, 2006
"... We show that several important Bayesian bounds studied in machine learning, both in the batch as well as the online setting, arise by an application of a simple compression lemma. In particular, we derive (i) PACBayesian bounds in the batch setting, (ii) Bayesian logloss bounds and (iii) Bayesian ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
We show that several important Bayesian bounds studied in machine learning, both in the batch as well as the online setting, arise by an application of a simple compression lemma. In particular, we derive (i) PACBayesian bounds in the batch setting, (ii) Bayesian logloss bounds and (iii) Bayesian boundedloss bounds in the online setting using the compression lemma. Although every setting has different semantics for prior, posterior and loss, we show that the core bound argument is the same. The paper simplifies our understanding of several important and apparently disparate results, as well as brings to light a powerful tool for developing similar arguments for other methods. 1.
Maximum entropy density estimation and modeling geographic distributions of species
, 2007
"... Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used densityestimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but t ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used densityestimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but theory explaining their properties is often missing or needs to be derived for each case separately. In this dissertation, we propose a unified treatment for a large and general class of smoothing techniques. We provide fully general guarantees on their statistical performance and propose optimization algorithms with complete convergence proofs. As special cases, we can easily derive performance guarantees for many known regularization types including L1 and L2squared regularization. Furthermore, our general approach enables us to derive entirely new regularization functions with superior statistical guarantees. The new regularization functions use information about the structure of the feature space, incorporate information about sample selection bias, and combine information across several related densityestimation tasks. We propose algorithms solving a large and general subclass of generalized maxent problems, including all
Entropy and Equilibrium via Games of Complexity
"... It is suggested that thermodynamical equilibrium equals game theoretical equilibrium. Aspects of this thesis are discussed. The philosophy is consistent with maximum entropy thinking of Jaynes, but goes one step deeper by deriving the maximum entropy principle from an underlying game theoretical pri ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
It is suggested that thermodynamical equilibrium equals game theoretical equilibrium. Aspects of this thesis are discussed. The philosophy is consistent with maximum entropy thinking of Jaynes, but goes one step deeper by deriving the maximum entropy principle from an underlying game theoretical principle. The games introduced are based on measures of complexity. Entropy is viewed as minimal complexity. It is demonstrated that Tsallis entropy (qentropy) and Kaniadakis entropy (κentropy) can be obtained in this way, based on suitable complexity measures. A certain unifying effect is obtained by embedding these measures in a twoparameter family of entropy functions.
Information theory at the service of science. In
 of Bolyai Society Mathematical Studies
, 2007
"... Information theory is becoming more and more important for many fields. This is true for engineering and technologybased areas but also for more theoretically oriented sciences such as probability and statistics. Aspects of this development is first discussed at the nontechnical level with emphas ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Information theory is becoming more and more important for many fields. This is true for engineering and technologybased areas but also for more theoretically oriented sciences such as probability and statistics. Aspects of this development is first discussed at the nontechnical level with emphasis on the role of information theoretical games. The overall rationale is explained and central types of examples presented where the game theoretical approach is useful. The final section contains full proofs related to a subject of central importance for statistics, the estimation or updating by a posterior distribution which aims at minimizing divergence measured relative to a given prior.
AN ENTROPY POWER INEQUALITY FOR THE BINOMIAL FAMILY
, 2003
"... Communicated by S.S. Dragomir ABSTRACT. In this paper, we prove that the classical Entropy Power Inequality, as derived in the continuous case, can be extended to the discrete family of binomial random variables with parameter 1/2. ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Communicated by S.S. Dragomir ABSTRACT. In this paper, we prove that the classical Entropy Power Inequality, as derived in the continuous case, can be extended to the discrete family of binomial random variables with parameter 1/2.