Results 1  10
of
31
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 261 (44 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications,” manuscript, available at wwwstat.wharton.upenn.edu/~buja
, 2005
"... What are the natural loss functions or fitting criteria for binary class probability estimation? This question has a simple answer: socalled “proper scoring rules”, that is, functions that score probability estimates in view of data in a Fisherconsistent manner. Proper scoring rules comprise most ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
What are the natural loss functions or fitting criteria for binary class probability estimation? This question has a simple answer: socalled “proper scoring rules”, that is, functions that score probability estimates in view of data in a Fisherconsistent manner. Proper scoring rules comprise most loss functions currently in use: logloss, squared error loss, boosting loss, and as limiting cases costweighted misclassification losses. Proper scoring rules have a rich structure: • Every proper scoring rules is a mixture (limit of sums) of costweighted misclassification losses. The mixture is specified by a weight function (or measure) that describes which misclassification cost weights are most emphasized by the proper scoring rule. • Proper scoring rules permit Fisher scoring and Iteratively Reweighted LS algorithms for model fitting. The weights are derived from a link function and the above weight function. • Proper scoring rules are in a 11 correspondence with information measures for treebased classification.
Discriminative, Generative and Imitative Learning
, 2002
"... I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specif ..."
Abstract

Cited by 45 (1 self)
 Add to MetaCart
I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specific knowledge in terms of structure and parameter priors over the joint space of variables. Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying this knowledge and subsequently refining it with data and observations. The final result is a distribution that is a good generator of novel exemplars.
Additive Models, Boosting, and Inference for Generalized Divergences
 In Proc. 12th Annu. Conf. on Comput. Learning Theory
, 1999
"... We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
(Show Context)
We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family of Bregman divergences is shown to yield a family of loss functions that includes the loglikelihood criterion of logistic regression as a special case, and that closely approximates the exponential loss criterion used in the AdaBoost algorithms of Schapire et al., as the natural parameter of the family varies. We also show how the quadratic approximation of the gain in Bregman divergence results in a weighted leastsquares criterion. This leads to a family of incremental learning algorithms that builds upon and extends the recent interpretation of boosting in terms of additive models proposed by Friedman, Hastie, and Tibshirani. 1 Introduction Logistic regression is a widely used statisti...
Marginbased ranking meets boosting in the middle
 In: Learning Theory: COLT 2005, SpringerVerlag (2005) 63–78
, 2005
"... Abstract. We present a marginbased bound for ranking in a general setting, using the L ∞ covering number of the hypothesis space as our complexity measure. Our bound suggests that ranking algorithms that maximize the ranking margin will generalize well. We produce a Smooth Margin Ranking algorithm, ..."
Abstract

Cited by 27 (12 self)
 Add to MetaCart
Abstract. We present a marginbased bound for ranking in a general setting, using the L ∞ covering number of the hypothesis space as our complexity measure. Our bound suggests that ranking algorithms that maximize the ranking margin will generalize well. We produce a Smooth Margin Ranking algorithm, which is a modification of RankBoost analogous to Approximate Coordinate Ascent Boosting. We prove that this algorithm makes progress with respect to the ranking margin at every iteration and converges to a maximum margin solution. In the special case of bipartite ranking, the objective function of RankBoost is related to an exponentiated version of the AUC. In the empirical studies of Cortes and Mohri, and Caruana and NiculescuMizil, it has been observed that AdaBoost tends to maximize the AUC. In this paper, we give natural conditions such that AdaBoost maximizes the exponentiated loss associated with the AUC, i.e., conditions when AdaBoost and RankBoost will produce the same result, explaining the empirical observations. 1
The Latent Maximum Entropy Principle
 In Proc. of ISIT
, 2002
"... We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of h ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
We present an extension to Jaynes' maximum entropy principle that handles latent variables. The principle of latent maximum entropy we propose is di#erent from both Jaynes' maximum entropy principle and maximum likelihood estimation, but often yields better estimates in the presence of hidden variables and limited training data. We first show that solving for a latent maximum entropy model poses a hard nonlinear constrained optimization problem in general. However, we then show that feasible solutions to this problem can be obtained e#ciently for the special case of loglinear modelswhich forms the basis for an e#cient approximation to the latent maximum entropy principle. We derive an algorithm that combines expectationmaximization with iterative scaling to produce feasible loglinear solutions. This algorithm can be interpreted as an alternating minimization algorithm in the information divergence, and reveals an intimate connection between the latent maximum entropy and maximum likelihood principles.
Sketching information divergences
 In Conference on Learning Theory
, 2007
"... When comparing discrete probability distributions, natural measures of similarity are not ℓp distances but rather are information divergences such as KullbackLeibler and Hellinger. This paper considers some of the issues related to constructing smallspace sketches of distributions in the datastre ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
When comparing discrete probability distributions, natural measures of similarity are not ℓp distances but rather are information divergences such as KullbackLeibler and Hellinger. This paper considers some of the issues related to constructing smallspace sketches of distributions in the datastream model, a concept related to dimensionality reduction, such that these measures can be approximated from the sketches. Related problems for ℓp distances are reasonably well understood via a series of results by Johnson & Lindenstrauss (1984), Alon et al. (1999), Indyk (2000), and Brinkman & Charikar (2003). In contrast, almost no analogous results are known to date about constructing sketches for the information divergences used in statistics and learning theory. Our main result is an impossibility result that shows that no smallspace sketches exist for the multiplicative approximation of any commonly used fdivergences and Bregman divergences with the notable exceptions of ℓ1 and ℓ2 where smallspace sketches exist. We then present datastream algorithms for the additive approximation of a wide range of information divergences. Throughout, our emphasis is on providing general characterizations.
Analysis of semisupervised learning with the Yarowsky algorithm
 23rd Conference on Uncertainty in Artificial Intelligence (UAI
, 2007
"... The Yarowsky algorithm is a rulebased semisupervised learning algorithm that has been successfully applied to some problems in computational linguistics. The algorithm was not mathematically well understood until (Abney 2004) which analyzed some specific variants of the algorithm, and also proposed ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
The Yarowsky algorithm is a rulebased semisupervised learning algorithm that has been successfully applied to some problems in computational linguistics. The algorithm was not mathematically well understood until (Abney 2004) which analyzed some specific variants of the algorithm, and also proposed some new algorithms for bootstrapping. In this paper, we extend Abney’s work and show that some of his proposed algorithms actually optimize (an upperbound on) an objective function based on a new definition of crossentropy which is based on a particular instantiation of the Bregman distance between probability distributions. Moreover, we suggest some new algorithms for rulebased semisupervised learning and show connections with harmonic functions and minimum multiway cuts in graphbased semisupervised learning. 1
Semisupervised learning with measure propagation
 Journal of Machine Learning. Research
, 2011
"... We describe a new objective for graphbased semisupervised learning based on minimizing the KullbackLeibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. W ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
We describe a new objective for graphbased semisupervised learning based on minimizing the KullbackLeibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semisupervised learning problem on very large data sets, for example, in one instance we use a data set with over 10 8 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graphbased semisupervised learning approaches. We compare the proposed approach against other standard semisupervised learning algorithms on the semisupervised learning benchmark data sets (Chapelle et al., 2007), and other realworld tasks such as text classification on Reuters and WebKB, speech phone classification on TIMIT and Switchboard, and linguistic dialogact tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the stateoftheart. Lastly, we show that our objective can be generalized into a form that includes the standard squarederror loss, and we prove a geometric rate of convergence in that case.