Results 1  10
of
16
Information, Divergence and Risk for Binary Experiments
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all are related to costsensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating fdivergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants.
Composite Multiclass Losses
"... We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
(Show Context)
We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, ordersensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration ” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
MIXABILITY IS BAYES RISK CURVATURE RELATIVE TO LOG LOSS
"... Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one can guarantee constant regret, which does not grow with the number of outcomes that need to be predicted. In this setting, it is known for which other losses the same guarantee can be given: these are the losses that are mixable. We show that among the mixable losses, log loss is special: in fact, one may understand the class of mixable losses as those that behave like log loss in an essential way. More specifically, a loss is mixable if and only if the curvature of its Bayes risk is at least as large as the curvature of the Bayes risk for log loss (for which the Bayes risk equals the entropy). 1.
A unified view of performance metrics: Translating threshold choices into expected classification loss
 Journal of Machine Learning Research
, 2012
"... Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, scoreuniform, scoredriven, ratedriven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform
Surrogate regret bounds for the area under the ROC curve via strongly proper losses
 In COLT
, 2013
"... The area under the ROC curve (AUC) is a widely used performance measure in machine learning, and has been widely studied in recent years particularly in the context of bipartite ranking. A dominant theoretical and algorithmic framework for AUC optimization/bipartite ranking has been to reduce the ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
The area under the ROC curve (AUC) is a widely used performance measure in machine learning, and has been widely studied in recent years particularly in the context of bipartite ranking. A dominant theoretical and algorithmic framework for AUC optimization/bipartite ranking has been to reduce the problem to pairwise classification; in particular, it is well known that the AUC regret can be formulated as a pairwise classification regret, which in turn can be upper bounded using usual regret bounds for binary classification. Recently, Kotlowski et al. (2011) showed AUC regret bounds in terms of the regret associated with ‘balanced ’ versions of the standard (nonpairwise) logistic and exponential losses. In this paper, we obtain such (nonpairwise) surrogate regret bounds for the AUC in terms of a broad class of proper (composite) losses that we term strongly proper. Our proof technique is considerably simpler than that of Kotlowski et al. (2011), and relies on properties of proper (composite) losses as elucidated recently by Reid and Williamson (2009, 2010, 2011) and others. Our result yields explicit surrogate bounds (with no hidden balancing terms) in terms of a variety of strongly proper losses, including for example logistic, exponential, squared and squared hinge losses. An important consequence is that standard algorithms minimizing a (nonpairwise) strongly proper loss, such as logistic regression and boosting algorithms (assuming a universal function class and appropriate regularization), are in fact AUCconsistent; moreover, our results allow us to quantify the AUC regret in terms of the corresponding surrogate regret. We also obtain tighter surrogate regret bounds under certain lownoise conditions via a recent result of Clémençon and Robbiano (2011).
Consistent Binary Classification with Generalized Performance Metrics
"... Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives. Despite significant interest from theoretical and applied communities, little is known about either op ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives. Despite significant interest from theoretical and applied communities, little is known about either optimal classifiers or consistent algorithms for optimizing binary classification performance metrics beyond a few special cases. We consider a fairly large family of performance metrics given by ratios of linear combinations of the four fundamental population quantities. This family includes many well known binary classification metrics such as classification accuracy, AM measure, Fmeasure and the Jaccard similarity coefficient as special cases. Our analysis identifies the optimal classifiers as the sign of the thresholded conditional probability of the positive class, with a performance metricdependent threshold. The optimal threshold can be constructed using simple plugin estimators when the performance metric is a linear combination of the population quantities, but alternative techniques are required for the general case. We propose two algorithms for estimating the optimal classifiers, and prove their statistical consistency. Both algorithms are straightforward modifications of standard approaches to address the key challenge of optimal threshold selection, thus are simple to implement in practice. The first algorithm combines a plugin estimate of the conditional probability of the positive class with optimal threshold selection. The second algorithm leverages recent work on calibrated asymmetric surrogate losses to construct candidate classifiers. We present empirical comparisons between these algorithms on benchmark datasets. 1
The convexity and design of composite multiclass losses
 In Proceedings of the 29th International Conference on Machine Learning (ICML12
"... We consider composite loss functions for multiclass prediction comprising a proper (i.e., Fisherconsistent) loss over probability distributions and an inverse link function. We establish conditions for their (strong) convexity and explore the implications. We also show how the separation of concerns ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider composite loss functions for multiclass prediction comprising a proper (i.e., Fisherconsistent) loss over probability distributions and an inverse link function. We establish conditions for their (strong) convexity and explore the implications. We also show how the separation of concerns afforded by using this composite representation allows for the design of families of losses with the same Bayes risk. 1.
25th Annual Conference on Learning Theory Divergences and Risks for Multiclass Experiments
"... Csiszár’s fdivergence is a way to measure the similarity of two probability distributions. We study the extension of fdivergence to more than two distributions to measure their joint similarity. By exploiting classical results from the comparison of experiments literature we prove the resulting di ..."
Abstract
 Add to MetaCart
Csiszár’s fdivergence is a way to measure the similarity of two probability distributions. We study the extension of fdivergence to more than two distributions to measure their joint similarity. By exploiting classical results from the comparison of experiments literature we prove the resulting divergence satisfies all the same properties as the traditional binary one. Considering the multidistribution case actually makes the proofs simpler. The key to these results is a formal bridge between these multidistribution fdivergences and Bayes risks for multiclass classification problems.
Journal of Machine Learning Research? (2014) n–n+48 Submitted 7/14; Published? Composite Multiclass Losses
"... We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on “classi ..."
Abstract
 Add to MetaCart
We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on “classification calibration ” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. We determine the stationarity condition, Bregman representation, ordersensitivity, and quasiconvexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation for multiclass losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk.
On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems
"... A popular approach to solving multiclass learning problems is to reduce them to a set of binary classification problems through some output code matrix: the widely used onevsall and allpairs methods, and the errorcorrecting output code methods of Dietterich and Bakiri (1995), can all be viewed a ..."
Abstract
 Add to MetaCart
A popular approach to solving multiclass learning problems is to reduce them to a set of binary classification problems through some output code matrix: the widely used onevsall and allpairs methods, and the errorcorrecting output code methods of Dietterich and Bakiri (1995), can all be viewed as special cases of this approach. In this paper, we consider the question of statistical consistency of such methods. We focus on settings where the binary problems are solved by minimizing a binary surrogate loss, and derive general conditions on the binary surrogate loss under which the onevsall and allpairs code matrices yield consistent algorithms with respect to the multiclass 01 loss. We then consider general multiclass learning problems defined by a general multiclass loss, and derive conditions on the output code matrix and binary surrogates under which the resulting algorithm is consistent with respect to the target multiclass loss. We also consider probabilistic code matrices, where one reduces a multiclass problem to a set of class probability labeled binary problems, and show that these can yield benefits in the sense of requiring a smaller number of binary problems to achieve overall consistency. Our analysis makes interesting connections with the theory of proper composite losses (Buja et al., 2005; Reid and Williamson, 2010); these play a role