## Generalization bounds for the area under the ROC curve

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 48 - 6 self |

### BibTeX

@ARTICLE{Agarwal_generalizationbounds,

author = {Shivani Agarwal and Thore Graepel and Ralf Herbrich and Sariel Har-peled and Dan Roth},

title = {Generalization bounds for the area under the ROC curve},

journal = {Journal of Machine Learning Research},

year = {},

pages = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

We study generalization properties of the area under an ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for bipartite ranking problems. The AUC is a different and more complex term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define a precise notion of the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard shatter coefficients (also known variously as the counting numbers or growth function) in uniform convergence results for the classification error rate. We also compare our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC; as we show, the bound provided by our result is considerably tighter. 1 1

### Citations

9061 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...rected cycles, and therefore there exists a complete order on the vertices of G ′ (B) that is consistent with the partial order defined by the edges of G ′ (B) (topological sorting; see, for example, =-=Cormen et al., 2001-=-, Section 22.4). This implies a unique order on the vertices of G(B) (in which vertices connected by undirected edges are assigned the same position in the ordering). For any x ∈ X m , x ′ ∈ X n , ide... |

3389 | An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...AUC. For example, one may attempt to estimate the quantities p1, p2 and A( f) that appear in the expression in Eq. (7) directly from the data, or one may use resampling methods such as the bootstrap (=-=Efron and Tibshirani, 1993-=-), in which the variance is estimated from the sample variance observed over a number of bootstrap samples obtained from the data. The confidence intervals obtained using such estimates are only appro... |

1573 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...s of these coefficients. 4.2 Uniform Convergence Bound We first recall some classical inequalities that will be used in deriving our result, namely, Chebyshev’s inequality and Hoeffding’s inequality (=-=Hoeffding, 1963-=-): Theorem 4 (Chebyshev’s inequality). Let X be a random variable. Then for any ɛ > 0, P {|X − E{X}| ≥ ɛ} ≤ Var{X} ɛ 2 . Theorem 5 (Hoeffding, 1963). Let X1, . . . , XN be independent bounded random v... |

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ... We also recall the following standard result due to Vapnik and Chervonenkis (1971) and Sauer (1972) which provides an upper bound on the shatter coefficients in terms of the VC dimension: Theorem 8 (=-=Vapnik and Chervonenkis, 1971-=-; Sauer, 1972). Let H be a class of binaryvalued functions on X , with VC dimension VH. Then for all N ≥ 2VH, s(H, N) ≤ VH∑ i=0 ( ) N i ≤ ( eN VH ) VH . Next, we define a series of classification func... |

835 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...er X × Y, let the expected error rate of h be denoted by L∗ (h) and defined as � � . (26) L ∗ (h) = EXY ∼D I{h(X)�=0}I{h(X)�=Y } + 1 2 I{h(X)=0} Then, following the proof of a similar result given in =-=[Vap82]-=- for binary-valued functions, it can be shown that if H is a class of Y∗-valued functions on X and M ∈ N, then for any ɛ > 0, � � � � � � PS∼DM sup � ≥ ɛ ≤ 6s(H, 2M)e −Mɛ2 /4 . (27) � h∈H ˆ L ∗ (h; S)... |

618 | The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143:29–36 - Haley, Mcneil - 1982 |

562 | An efficient boosting algorithm for combining preferences - Freund, Iyer, et al. |

406 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ...l be the following powerful concentration inequality of McDiarmid (1989), which bounds the deviation of any function of a sample for which a single change in the sample has limited effect: Theorem 3 (=-=McDiarmid, 1989-=-) Let X1,...,XN be independent random variables with Xk taking values in a set Ak for each k. Let φ : (A1 × ··· × AN)→R be such that Then for any ε > 0, sup xi∈Ai,x ′ k∈Ak � �φ(x1,...,xN) − φ(x1,...,x... |

348 | Learning to order things
- Cohen, Schapire, et al.
- 1999
(Show Context)
Citation Context ...s a ranking of the documents such that relevant documents are ranked higher than irrelevant documents. The problem of ranking has been studied from a learning perspective under a variety of settings (=-=Cohen et al., 1999-=-; Herbrich et al., 2000; Crammer and Singer, 2002; Freund et al., 2003). Here we consider the setting in which objects belong to one of two categories, positive and negative; the learner is given exam... |

318 |
Large margin rank boundaries for ordinal regression
- Herbrich, Graepel, et al.
- 2000
(Show Context)
Citation Context ...ocuments such that relevant documents are ranked higher than irrelevant documents. The problem of ranking has been studied from a learning perspective under a variety of settings (Cohen et al., 1999; =-=Herbrich et al., 2000-=-; Crammer and Singer, 2002; Freund et al., 2003). Here we consider the setting in which objects come from two categories, positive and negative; the learner is given examples of objects labeled as pos... |

294 |
Nonparametrics: Statistical Methods Based on Ranks. Upper Saddle River
- Lehmann
- 1998
(Show Context)
Citation Context ... label sequence of length N ∈ N. Let m be the number of positive labels in y, and n = N − m the number of negative labels in y. Then the variance of the AUC of f is given by the following expression (=-=Lehmann, 1975-=-): σ 2 � � A = VarTX |TY=y Â( f ;T) where = A( f)(1 − A( f))+(m − 1)(p1 − A( f) 2 )+(n − 1)(p2 − A( f) 2 ) , (7) mn p1 = P X + 1 ,X + 2 ∼D+1,X − 1 ∼D−1 p2 = P X + 1 ∼D+1,X − 1 ,X− 2 ∼D−1 �� Next we re... |

262 |
Signal Detection Theory and ROC Analysis
- Egan
- 1975
(Show Context)
Citation Context ...ng wellsuited for evaluating ranking functions relates to receiver operating characteristic (ROC) curves. ROC curves were originally developed in signal detection theory for analysis of radar images (=-=Egan, 1975-=-), and have been used extensively in various fields such as medical decision-making. Given a ranking function f : X →R and a finite data sequence T = ((x1,y1),...,(xN,yN)) ∈ (X × Y ) N , the ROC curve... |

247 |
On the density of families of sets
- Sauer
- 1972
(Show Context)
Citation Context ...tandard result due to Vapnik and Chervonenkis (1971) and Sauer (1972) which provides an upper bound on the shatter coefficients in terms of the VC dimension: Theorem 8 (Vapnik and Chervonenkis, 1971; =-=Sauer, 1972-=-). Let H be a class of binaryvalued functions on X , with VC dimension VH. Then for all N ≥ 2VH, s(H, N) ≤ VH∑ i=0 ( ) N i ≤ ( eN VH ) VH . Next, we define a series of classification function classes ... |

132 | Lectures on discrete geometry - MATOUˇSEK |

110 | Auc optimization vs. error rate minimization
- Cortes, Mohri
(Show Context)
Citation Context ...ints such that the resulting curve is monotonically increasing. It is the area under the ROC curve (AUC) that has been used as an indicator of the quality of the ranking function f (Yan et al., 2003; =-=Cortes and Mohri, 2004-=-). An AUC value of one corresponds to a perfect ranking on the given data sequence (i.e., all positive instances in T are ranked higher than all negative instances); a value of zero corresponds to the... |

56 | Relating data compression and learnability
- Littlestone, Warmuth
- 1986
(Show Context)
Citation Context ...ization bounds for the AUC can be derived using different proof techniques. Possible routes for deriving alternative bounds for the AUC could include the theory of compression bounds (Littlestone and =-=Warmuth, 1986-=-; Graepel et al., 2005). 417sAcknowledgments AGARWAL, GRAEPEL, HERBRICH, HAR-PELED AND ROTH We would like to thank the anonymous reviewers for many useful suggestions and for pointing us to the statis... |

37 |
Exponential inequalities in nonparametric estimation
- Devroye
- 1991
(Show Context)
Citation Context ... Proof of Theorem 17 We shall need the following result of Devroye (1991), which bounds the variance of any fuction of a sample for which a single change in the sample has limited effect: Theorem 28 (=-=Devroye, 1991-=-; Devroye et al., 1996, Theorem 9.3) Let X1,...,XN be independent random variables with Xk taking values in a set Ak for each k. Let φ : (A1 × ··· × AN)→R be such that Then sup xi∈Ai,x ′ k∈Ak � �φ(x1,... |

29 |
Partitioning of space
- Buck
- 1943
(Show Context)
Citation Context ...er of sign patterns ( ˜f(x1,x ′ 1 ),..., ˜f(xN,x ′ N )) that can be realized by functions ˜f ∈ ˜F lin(d) is equal to the total number of faces of this arrangement (Matouˇsek, 2002), which is at most (=-=Buck, 1943-=-) d ∑ k=0 d ∑ i=d−k � i �� � N d − k i Since the N points were arbitrary, the result follows. = d ∑ i=0 2 i � � N i ≤ � 2eN Theorem 24 For d ∈ N, let F lin(d) denote the class of linear ranking functi... |

28 |
Confidence intervals for the area under the ROC curve
- Cortes, Mohri
- 2005
(Show Context)
Citation Context ...e scores f(X − ) assigned to negative instances X − both follow negative exponential distributions. Distributionindependent bounds can be obtained by using the fact that the variance σ2 A is at most (=-=Cortes and Mohri, 2005-=-; Dantzig, 1915; Birnbaum and Klose, 1957) σ 2 max = A( f)(1 − A( f)) min(m,n) ≤ 1 . (14) 4min(m,n) A comparison of the resulting bounds with the large deviation bound we have derived above using McDi... |

28 | László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition - Devroye - 1996 |

17 | Model selection via the auc
- Rosset
(Show Context)
Citation Context ...uch that the resulting curve is monotonically increasing. It is the area under the ROC curve (AUC) that has been used as an indicator of the quality of the ranking function f (Cortes and Mohri, 2004; =-=Rosset, 2004-=-). An AUC value of one corresponds to a perfect ranking on the given data sequence (i.e., all positive instances in T are ranked higher than all negative instances); a value of zero corresponds to the... |

13 |
Learning in neural networks : Theoretical foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...ew of y defined in Eq. (6). = 4 · r � F , 2ρ(y)M, 2(1 − ρ(y))M � · e −ρ(y)(1−ρ(y))Mε2 /8 , The proof is adapted from proofs of uniform convergence for the classification error rate (see, for example, =-=Anthony and Bartlett, 1999-=-; Devroye et al., 1996). The main difference is that since the AUC cannot be expressed as a sum of independent random variables, more powerful inequalities are required. In particular, a result of Dev... |

8 | A uniform convergence bound for the area under the ROC curve - Agarwal, Har-Peled, et al. - 2005 |

5 | On t h e consistency and the power of Wilcoxon's two sample test - Dantzig - 1951 |

5 | PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification
- Graepel, Herbrich, et al.
(Show Context)
Citation Context ...for the AUC can be derived using different proof techniques. Possible routes for deriving alternative bounds for the AUC could include the theory of compression bounds (Littlestone and Warmuth, 1986; =-=Graepel et al., 2005-=-). 417sAcknowledgments AGARWAL, GRAEPEL, HERBRICH, HAR-PELED AND ROTH We would like to thank the anonymous reviewers for many useful suggestions and for pointing us to the statistical literature on ra... |

4 | A large deviation bound for the area under the ROC curve - Agarwal, Graepel, et al. - 2005 |

4 |
Bounds for the variance of the mann-whitney statistic
- Birnbaum, Klose
- 1957
(Show Context)
Citation Context ...nstances X − both follow negative exponential distributions. Distributionindependent bounds can be obtained by using the fact that the variance σ2 A is at most (Cortes and Mohri, 2005; Dantzig, 1915; =-=Birnbaum and Klose, 1957-=-) σ 2 max = A( f)(1 − A( f)) min(m,n) ≤ 1 . (14) 4min(m,n) A comparison of the resulting bounds with the large deviation bound we have derived above using McDiarmid’s inequality is shown in Figure 1. ... |

3 | Average precision and the problem of generalisation
- Hill, T, et al.
- 2002
(Show Context)
Citation Context ...xpected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. Our conceptual approach in deriving the large deviation result for the AUC is similar to that of (=-=Hill et al., 2002-=-), in which large deviation properties of the average precision were considered. Section 4 contains our uniform convergence result, which serves to bound the expected accuracy of a learned ranking fun... |

3 | de la Peña and Evarist Giné. Decoupling: From Dependence to Independence - Víctor - 1999 |

1 | Har-Peled & Roth Luc Devroye. Exponential inequalities in nonparametric estimation - Agarwal, Herbrich - 1991 |