## Composite Binary Losses (2009)

### Cached

### Download Links

Citations: | 13 - 9 self |

### BibTeX

@MISC{Reid09compositebinary,

author = {Mark D. Reid and Robert C. Williamson},

title = {Composite Binary Losses},

year = {2009}

}

### OpenURL

### Abstract

We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classification calibrated ” losses. We also consider the question of the “best ” surrogate binary loss. We introduce a precise notion of “best ” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning”. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. 1

### Citations

3623 | Variational Analysis
- Rockafellar, Wets
- 1998
(Show Context)
Citation Context ...(s)}. (55) The LF dual of any function is convex. When φ(s) is a function of a real argument s and the derivative φ ′ (s) exists, the Legendre-Fenchel conjugate φ ⋆ is given by the Legendre transform =-=[40, 21]-=- φ ⋆ (s) = s · (φ ′ ) −1 (s) − φ ( (φ ′ ) −1 (s) ) . (56) Thus (writing ∂f := f ′ ) f ′ = (∂f ⋆ ) −1 . Thus with w, W , and W defined as above, W = (∂(W ⋆ )) −1 , W −1 = ∂(W ⋆ ), W ⋆ = ∫ W −1 . (57) L... |

2337 | Support-vector Networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... label is y is necessarily the same as the penalty for predicting −v when the label is −y. Margin losses have attracted a lot of attention [5] because of their central role in Support Vector Machines =-=[11]-=-. In this section we explore the relationship between these margin losses and the more general class of composite losses and, in particular, symmetric composite losses. Recall that a general composite... |

2043 |
Robust Statistics
- Huber
- 1981
(Show Context)
Citation Context ...io [31] examine is akin to that studied for instance by Kearns [25]. There are many other meanings of “robust” which are different to that which we consider. The classical notion of robust statistics =-=[22]-=- is motivated by robustness to contamination of additive observation noise (some heavy-tail noise mixed in with the Gaussian noise often assumed in designing estimators). There are some results about ... |

1734 | Generalised linear models - NELDER, WEDDERSURN - 1972 |

344 | New support vector algorithms
- Scholkopf, Smola, et al.
(Show Context)
Citation Context ...on noise (some heavy-tail noise mixed in with the Gaussian noise often assumed in designing estimators). There are some results about particular machine learning algorithms being robust in that sense =-=[43]-=-. “Robust” is also used to mean robustness with respect to random attribute noise [48], robustness to unknown prior class probabilities [37], or a Huberstyle robustness to attribute noise (“outliers”)... |

319 | Clustering with Bregman divergences - Banerjee, Merugu, et al. - 2005 |

301 | Efcient noise-tolerant learning from statistical queries - Kearns - 1998 |

284 | The Foundations of Cost-Sensitive Learning
- Elkan
(Show Context)
Citation Context ...s focussed on margin losses which intrinsically treat positive and negative classes symmetrically. However it is now well understood how important it is to be able to deal with the non-symmetric case =-=[2, 12, 8, 9, 37]-=-. A key goal of the present work is to consider composite losses in the general (non-symmetric) situation. Having the flexibility to choose a loss function is important in order to “tailor” the soluti... |

270 | Robust Classification for Imprecise Environments
- Provost, Fawcett
- 2001
(Show Context)
Citation Context ...s focussed on margin losses which intrinsically treat positive and negative classes symmetrically. However it is now well understood how important it is to be able to deal with the non-symmetric case =-=[2, 12, 8, 9, 37]-=-. A key goal of the present work is to consider composite losses in the general (non-symmetric) situation. Having the flexibility to choose a loss function is important in order to “tailor” the soluti... |

174 | Strictly proper scoring rules, prediction, and estimation
- Gneiting, Raftery
- 2007
(Show Context)
Citation Context ... can be interpreted as probabilities. Having such probabilities is often important in applications, and there has been considerable interest in understanding how to get accurate probability estimates =-=[36, 16, 10]-=- and understanding the implications of requiring loss functions provide good probability estimates [6]. Much previous work in the machine learning literature has focussed on margin losses which intrin... |

145 |
Elicitation of personal probabilities and expectations
- Savage
- 1971
(Show Context)
Citation Context ... 5 proper loss with weight function w defined as in equation (6). Then for all c ∈ (0, 1) its conditional Bayes risk L satisfies w(c) = −L ′′ (c). (8) 4 This is equivalent to the conditions of Savage =-=[41]-=- and Schervish [42]. 5 The restriction to differentiable losses can be removed in most cases if generalised weight functions— that is, possibly infinite but defining a measure on (0, 1)—are permitted.... |

141 | Sparseness of support vector machines
- Steinwart
(Show Context)
Citation Context ...to work with algorithmically. Convex surrogate losses are often used in place of the 0-1 loss which is not convex. Surrogate losses have garnered increasing interest in the machine learning community =-=[50, 7, 46, 47]-=-. Some of the questions considered to date are bounding the regret of a desired loss in terms of a surrogate (“surrogate regret bounds” — see [39] and references therein), the relationship between the... |

124 | Fundamentals of Convex Analysis - Hiriart-Urruty, Lemaréchal - 2001 |

120 | Convexity, classification, and risk bounds
- Bartlett, Jordan, et al.
- 2005
(Show Context)
Citation Context ...ionship between regret and Bregman divergences for general composite losses. In §5 we characterise the relationship between classification calibrated losses (as studied for example by Bartlett et al. =-=[7]-=-) and proper composite losses. In §6, motivated by the question of which is the best surrogate loss, we characterise when a proper composite loss is convex in terms of the natural parametrisation of s... |

114 |
Probabilities for SV machines
- Platt
- 2000
(Show Context)
Citation Context ... can be interpreted as probabilities. Having such probabilities is often important in applications, and there has been considerable interest in understanding how to get accurate probability estimates =-=[36, 16, 10]-=- and understanding the implications of requiring loss functions provide good probability estimates [6]. Much previous work in the machine learning literature has focussed on margin losses which intrin... |

85 | Optimal prediction under asymmetric loss. Econometric Theory - Christo¤ersen, Diebold - 1997 |

81 | Two Notes on Notation
- Knuth
- 1992
(Show Context)
Citation Context ...observation-conditional density, taking the Maverage of the point-wise risk gives the (full) risk of the estimator v, now interpreted as 1 This is the Iverson bracket notation as recommended by Knuth =-=[27]-=-. 2 Restricting the output of a loss to [0, ∞) is equivalent to assuming the loss has a lower bound and then translating its output. 3 These are known as scoring rules in the statistical literature [1... |

80 | Game Theory, Maximum Entropy, Minimum Discrepancy and Robust Bayesian Decision Theory
- Grünwald, Dawid
- 2004
(Show Context)
Citation Context ...η) := inf L(η, v) v∈V is the point-wise or conditional Bayes risk. There has been increasing awareness of the importance of the conditional Bayes risk curve L(η) — also known as “generalized entropy” =-=[17]-=- — in the analysis of losses for probability estimation [23, 24, 1, 32]. Below we will see how it is effectively the curvature of L that determines much of the structure of these losses. 3 Losses for ... |

77 | Relative Loss Bounds for Multidimensional Regression Problems
- Kivinen, Warmuth
(Show Context)
Citation Context .... [9] introduced the notion of a canonical link defined by ψ ′ (v) = w(v). The canonical link corresponds to the notion of “matching loss” as developed by Helmbold et al. [20] and Kivinen and Warmuth =-=[26]-=-. Note that choice of canonical link implies ρ(c) = w(c)/ψ ′ (c) = 1. Lemma 27 Suppose ℓ is a proper loss with weight function w and ψ is the corresponding canonical link, then Φψ(x) = − w′ (x) . (43)... |

52 | Bayesian Estimation and Prediction Using Asymmetric Loss Functions - Zellner - 1986 |

37 | Relative Loss Bounds for Single Neurons
- Helmbold, Kivinen, et al.
- 1999
(Show Context)
Citation Context ...∀x ∈ (0, 1). 1 − x Buja et al. [9] introduced the notion of a canonical link defined by ψ ′ (v) = w(v). The canonical link corresponds to the notion of “matching loss” as developed by Helmbold et al. =-=[20]-=- and Kivinen and Warmuth [26]. Note that choice of canonical link implies ρ(c) = w(c)/ψ ′ (c) = 1. Lemma 27 Suppose ℓ is a proper loss with weight function w and ψ is the corresponding canonical link,... |

36 | Loss functions for binary class probability estimation: structure and applications
- Buja, Stuetzle, et al.
- 2005
(Show Context)
Citation Context ...s focussed on margin losses which intrinsically treat positive and negative classes symmetrically. However it is now well understood how important it is to be able to deal with the non-symmetric case =-=[2, 12, 8, 9, 37]-=-. A key goal of the present work is to consider composite losses in the general (non-symmetric) situation. Having the flexibility to choose a loss function is important in order to “tailor” the soluti... |

36 |
A general method for comparing probability assessors
- SCHERVISH
- 1989
(Show Context)
Citation Context ... weight function w defined as in equation (6). Then for all c ∈ (0, 1) its conditional Bayes risk L satisfies w(c) = −L ′′ (c). (8) 4 This is equivalent to the conditions of Savage [41] and Schervish =-=[42]-=-. 5 The restriction to differentiable losses can be removed in most cases if generalised weight functions— that is, possibly infinite but defining a measure on (0, 1)—are permitted. For example, the w... |

31 | Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling
- Fidler, Skocaj, et al.
- 2006
(Show Context)
Citation Context ...sed to mean robustness with respect to random attribute noise [48], robustness to unknown prior class probabilities [37], or a Huberstyle robustness to attribute noise (“outliers”) for classification =-=[13]-=-. We only study robustness in the sense of random label noise. 31susceptible to random class noise. In particular they present a very simple learning task which is “boostable” – can be perfectly solv... |

30 |
Decomposing Statistical Question
- Hand
- 1994
(Show Context)
Citation Context ...nsider composite losses in the general (non-symmetric) situation. Having the flexibility to choose a loss function is important in order to “tailor” the solution to a machine learning problem; confer =-=[18, 19, 9]-=-. Understanding the structure of the set of loss functions and having natural parametrisations of them is useful for this purpose. Even when one is using a loss as a surrogate for the loss one would i... |

30 | A note on margin-based loss functions in classification
- Lin
- 1999
(Show Context)
Citation Context ... opposite side of c to η, or ˆη = c. The condition CC 1 2 is equivalent to what is called “classification calibrated” by Bartlett et al. [7] and “Fisher consistent for classification problems” by Lin =-=[30]-=- although their definitions were only for margin losses. One might suspect that there is a connection between classification calibrated at c and standard Fisher consistency for class probability estim... |

28 | Considering cost asymmetry in learning classifiers - Bach, Heckerman, et al. |

27 | A stochastic view of optimal regret through minimax duality
- Abernethy, Agarwal, et al.
- 2009
(Show Context)
Citation Context ...binary classification label, but providing 1an estimate of the probability that an example will have a positive label. Link functions are often used to map the outputs of a predictor to the interval =-=[0, 1]-=- so that they can be interpreted as probabilities. Having such probabilities is often important in applications, and there has been considerable interest in understanding how to get accurate probabili... |

27 | Sparseness vs estimating conditional probabilities: Some asymptotic results
- Bartlett, Tewari
- 2007
(Show Context)
Citation Context ...as been considerable interest in understanding how to get accurate probability estimates [36, 16, 10] and understanding the implications of requiring loss functions provide good probability estimates =-=[6]-=-. Much previous work in the machine learning literature has focussed on margin losses which intrinsically treat positive and negative classes symmetrically. However it is now well understood how impor... |

27 | Properties and benefits of calibrated classifiers - Cohen, Goldszmidt - 2004 |

22 | Eliciting properties of probability distributions
- Lambert, Pennock, et al.
- 2008
(Show Context)
Citation Context ...cording to w(c)), and then choose the link somewhat arbitrarily to map the hypotheses appropriately. An interesting alternative perspective arises in the literature on “elicitability”. Lambert et al. =-=[28]-=- 7 provide a general characterisation of proper scoring rules (i.e. losses) for general properties of distributions, that is, continuous and locally nonconstant functions Γ which assign a real value t... |

19 |
Divergence function, duality, and convex analysis
- Zhang
- 2004
(Show Context)
Citation Context ...of w. Note that the standard parametrisation for a Bregman divergence is in terms of the convex function W . Thus will write D W , DW and Dw to all represent (58). The following theorem is known (e.g =-=[49]-=-) but as will be seen, stating it in terms of DW provides some advantages. Theorem 34 Let w, W , W and DW be as above. Then for all x, y ∈ [0, 1], Proof Using (56) we have Equivalently (using (57)) DW... |

18 |
Simeonov Integral inequalities and applications
- Bainov, P
- 1992
(Show Context)
Citation Context ...ns. Corollary 30 If a loss is proper and convex, then it is strictly proper. The proof of Corollary 30 makes use of the following special case of the Gronwall style Lemma 1.1.1 of Bainov and Simeonov =-=[3]-=-. Lemma 31 Let b: R → R be continuous for t ≥ α. Let v(t) be differentiable for t ≥ α and suppose v ′(t) ≤ b(t)v(t), for t ≥ α and v(α) ≤ v0. Then for t ≥ α, (∫ t ) v(t) ≤ v0 exp b(s)ds . α 23Proof (... |

18 | Estimating class membership probabilities using classifier learners
- Langford, Zadrozny
- 2005
(Show Context)
Citation Context ...tion between classification calibrated at c and standard Fisher consistency for class probability estimation losses. The following theorem, which captures the intuition behind the “probing” reduction =-=[29]-=-, characterises the situation. Theorem 16 A CPE loss ℓ is CCc for all c ∈ (0, 1) if and only if ℓ is strictly proper. Proof L is CCc for all c ∈ (0, 1) is equivalent to { L(η) < inf ˆη≥c L(η, ˆη), η <... |

18 | Random classification noise defeats all convex potential boosters
- Long, Servedio
- 2008
(Show Context)
Citation Context ... convexity of proper losses (Theorem 29) allows one to make general algorithm independent statements about the robustness of convex proper losses to random mis-classification noise. Long and Servedio =-=[31]-=- have shown that that boosting with convex potential functions (i.e., convex margin losses) is not robust to random class noise10 . That is, they are 10 We define exactly what we mean by robustness be... |

17 |
Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Grundlehren Text Editions
- Jean-Baptiste
- 2001
(Show Context)
Citation Context ...(s)}. (55) The LF dual of any function is convex. When φ(s) is a function of a real argument s and the derivative φ ′ (s) exists, the Legendre-Fenchel conjugate φ ⋆ is given by the Legendre transform =-=[40, 21]-=- φ ⋆ (s) = s · (φ ′ ) −1 (s) − φ ( (φ ′ ) −1 (s) ) . (56) Thus (writing ∂f := f ′ ) f ′ = (∂f ⋆ ) −1 . Thus with w, W , and W defined as above, W = (∂(W ⋆ )) −1 , W −1 = ∂(W ⋆ ), W ⋆ = ∫ W −1 . (57) L... |

17 | How to compare different loss functions and their risks
- Steinwart
(Show Context)
Citation Context ...to work with algorithmically. Convex surrogate losses are often used in place of the 0-1 loss which is not convex. Surrogate losses have garnered increasing interest in the machine learning community =-=[50, 7, 46, 47]-=-. Some of the questions considered to date are bounding the regret of a desired loss in terms of a surrogate (“surrogate regret bounds” — see [39] and references therein), the relationship between the... |

16 |
Local versus global models for classification problems: fitting models where it matters, The American Statistician 57(2
- Hand, Vinciotti
- 2003
(Show Context)
Citation Context ...nsider composite losses in the general (non-symmetric) situation. Having the flexibility to choose a loss function is important in order to “tailor” the solution to a machine learning problem; confer =-=[18, 19, 9]-=-. Understanding the structure of the set of loss functions and having natural parametrisations of them is useful for this purpose. Even when one is using a loss as a surrogate for the loss one would i... |

16 |
Admissible probability measurement procedures
- Shuford, Albert, et al.
- 1966
(Show Context)
Citation Context ...inder of this paper will involve losses which are proper, fair, definite and regular. 3.2 The Structure of Proper Losses A key result in the study of proper losses is originally due to Shuford et al. =-=[45]-=- though our presentation follows that of Buja et al. [9]. It characterises proper losses for probability estimation via a constraint on the relationship between its partial losses. Theorem 1 Suppose ℓ... |

13 |
On the design of loss functions for classification: theory, robustness to outliers, and savageboost
- Masnadi-Shirazi, Vasconcelos
- 2008
(Show Context)
Citation Context ... risk. There has been increasing awareness of the importance of the conditional Bayes risk curve L(η) — also known as “generalized entropy” [17] — in the analysis of losses for probability estimation =-=[23, 24, 1, 32]-=-. Below we will see how it is effectively the curvature of L that determines much of the structure of these losses. 3 Losses for Class Probability Estimation We begin by considering CPE losses, that i... |

13 |
Statistical behaviour and consistency of classification methods based on convex risk minimization
- Zhang
(Show Context)
Citation Context ...ts link. Corollary 13 Let ℓ ψ be a proper composite loss with invertible link. Then for all η, ˆη ∈ (0, 1), ∆L ψ (η, v) = D−L(η, ψ −1 (v)). (25) 12This corollary generalises the results due to Zhang =-=[50]-=- and Masnadi-Shirazi and Vasconcelos [32] who considered only margin losses respectively without and with links. 4.2 Margin Losses The margin associated with a real-valued prediction v ∈ R and label y... |

8 | Bregman divergences and surrogates for learning
- NOCK, NIELSEN
- 2009
(Show Context)
Citation Context ...39] and references therein), the relationship between the decision theoretic perspective and the elicitability perspective [32], and efficient algorithms for minimising convex surrogate margin losses =-=[35, 34]-=-. Typically convex surrogates are used because they lead to convex, and thus tractable, optimisation problems. To date, work on surrogate losses has focussed on margin losses which necessarily are sym... |

7 | B.: Machine learning techniques reductions between prediction quality metrics - Beygelzimer, Langford, et al. - 2008 |

7 | Surrogate regret bounds for proper losses - REID, WILLIAMSON - 2009 |

6 |
Information, divergence and risk for binary experiments. arXiv preprint arXiv:0901.0356v1
- Reid, Williamson
- 2009
(Show Context)
Citation Context ...ght function is captured succinctly via the following representation of proper losses as a weighted integral of the cost-weighted misclassification losses ℓc defined in (2). The reader is referred to =-=[39]-=- for the details, proof and the history of this result. 6 A concise summary of Bregman divergences and their properties is given by Banerjee et al. [4, Appendix A]. 6Figure 1: The structure of the co... |

6 |
Robust classification and regression using support vector machines
- Trafalis, Gilbert
- 2006
(Show Context)
Citation Context ...gning estimators). There are some results about particular machine learning algorithms being robust in that sense [43]. “Robust” is also used to mean robustness with respect to random attribute noise =-=[48]-=-, robustness to unknown prior class probabilities [37], or a Huberstyle robustness to attribute noise (“outliers”) for classification [13]. We only study robustness in the sense of random label noise.... |

4 | Coherence functions for multicategory margin-based classification methods
- Zhang, Jordan, et al.
- 2009
(Show Context)
Citation Context ...es parametrised by α ∈ (0, ∞) φα(v) = log(exp(1 − v)α) + 1) . α This family of differentiable convex losses approximates the hinge loss as α → 0 and was studied in the multiclass case by Zhang et al. =-=[51]-=-. Since these are all differentiable functions with φ ′ α(v) = −eα(1−v) eα(1−v) , Corollary 14 and a little algebra gives +1 ψ −1 (v) = [ 1 + e2α + e α(1−v) e 2α + e α(1+v) Examining this family of in... |

4 | Joint and Separate Convexity of Bregman Distance - Bauschke, Borwein - 2001 |

3 | On the efficient minimization of classification calibrated surrogates
- Nock, Nielsen
(Show Context)
Citation Context ...39] and references therein), the relationship between the decision theoretic perspective and the elicitability perspective [32], and efficient algorithms for minimising convex surrogate margin losses =-=[35, 34]-=-. Typically convex surrogates are used because they lead to convex, and thus tractable, optimisation problems. To date, work on surrogate losses has focussed on margin losses which necessarily are sym... |

3 | Loss Functions for Binary Classification and Class Probability Estimation - Shen - 2005 |