## Convexity, Classification, and Risk Bounds (2003)

### Cached

### Download Links

Venue: | JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION |

Citations: | 119 - 12 self |

### BibTeX

@TECHREPORT{Bartlett03convexity,classification,,

author = {Peter L. Bartlett and Michael I. Jordan and Jon D. McAuliffe},

title = {Convexity, Classification, and Risk Bounds},

institution = {JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we

### Citations

3736 |
Convex optimization
- oyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...x programs (Nesterov and Nemirovskii, 1994). Many fields in which optimality principles form the core conceptual structure have been changed significantly by the introduction of these new techniques (=-=Boyd and Vandenberghe, 2004-=-). Convexity arises in many guises in statistics as well, notably in properties associated with the exponential family of distributions (Brown, 1986). It is, however, only in recent years that the sys... |

3296 | Convex Analysis
- ROCKAFELLAR
- 1970
(Show Context)
Citation Context ...ativity of ψ is established below in Lemma 5, part 7. 7sRecall that g is convex if and only if epi g is a convex set, and g is closed (epi g is a closed set) if and only if g is lower semicontinuous =-=(Rockafellar, 1997). By Le-=-mma 5, part 5, ˜ ψ is continuous, so in fact the closure operation in Definition 2 is vacuous. We therefore have that ψ is simply the functional convex hull of ˜ ψ, ψ = co ˜ ψ , which is equiv... |

2335 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...rning make significant use of convexity; in particular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002), boosting =-=(Freund and Schapire, 1997-=-, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational inference for graphical models (Jordan et al., 1999) are all based directly on ideas from convex optimization. If algorithms from c... |

2202 | Support vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...omputational efficiency is an imperative. Many of the most prominent methods studied in machine learning make significant use of convexity; in particular, support vector machines (Boser et al., 1992, =-=Cortes and Vapnik, 1995,-=- Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational inference for graphical models... |

2047 | Learning with kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...ominent methods studied in machine learning make significant use of convexity; in particular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, =-=Schölkopf and Smola, 2002-=-), boosting (Freund and Schapire, 1997, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational inference for graphical models (Jordan et al., 1999) are all based directly on ideas from con... |

1311 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...l models for which computational efficiency is an imperative. Many of the most prominent methods studied in machine learning make significant use of convexity; in particular, support vector machines (=-=Boser et al., 1992,-=- Cortes and Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational infer... |

1227 | Additive logistic regression: a statistical view of boosting, The Annals of Statistics 38 (2
- Friedman, Hastie, et al.
- 2000
(Show Context)
Citation Context ...n particular, Figure 1 shows the (upper-bounding) convex surrogates associated with the support vector machine (Cortes and Vapnik, 1995), Adaboost (Freund and Schapire, 1997) and logistic regression (=-=Friedman et al., 2000). A ba-=-sic statistical understanding of this setting has begun to emerge. In particular, when 2s0 1 2 3 4 5 6 7 −2 −1 0 1 2 α 0−1 exponential hinge logistic truncated quadratic Figure 1: A plot of the... |

1004 |
G.: A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...it is straightforward to show that R(f) − R ∗ = R(f) − R(η − 1/2) = E (1 [sign(f(X)) �= sign(η(X) − 1/2)] |2η(X) − 1|) , where 1 [Φ] is 1 if the predicate Φ is true and 0 otherwise =-=(see, for example, Devroye et al., 1996). We can apply Jens-=-en’s inequality, since ψ is convex by definition, and the fact that ψ(0) = 0 (Lemma 5, part 8) to show that ψ(R(f) − R ∗ ) ≤ Eψ (1 [sign(f(X)) �= sign(η(X) − 1/2)] |2η(X) − 1|) = E... |

838 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational inference for graphical models =-=(Jordan et al., 1999-=-) are all based directly on ideas from convex optimization. If algorithms from convex optimization are to continue to make inroads into statistical theory and practice, it is important that we underst... |

727 | Boosting the Margin: a New Explanation for the Effectiveness of Voting Methods
- Schapire, Freund, et al.
(Show Context)
Citation Context ...n obtained for function classes with infinite VC-dimension but finite fat-shattering dimension (Bartlett, 1998, 3sShawe-Taylor et al., 1998), such as the function classes used by AdaBoost (see, e.g., =-=Schapire et al., 1998-=-, Koltchinskii and Panchenko, 2002). These upper bounds provide guidance for model selection and in particular help guide data-dependent choices of regularization parameters. To carry this agenda furt... |

391 |
Learning to Classify Text using Support Vector Machines
- Joachims
- 2002
(Show Context)
Citation Context ...ly on ideas from convex optimization. These methods have had significant practical successes in applied areas such as bioinformatics, information management and signal processing (Feder et al., 2004, =-=Joachims, 2002-=-, Schölkopf et al., 2003). If algorithms from convex optimization are to continue to make inroads into statistical theory and practice, it is important that we understand these algorithms not only fro... |

344 | Weak Convergence and Empirical Processes - Vaart, Wellner - 2000 |

326 | The concentration of measure phenomenon - Ledoux - 2001 |

253 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...zer of the empirical average of the 0-1 loss. Indeed a number of such results have been obtained for function classes with infinite VC-dimension but finite fat-shattering dimension (Bartlett, 1998, 3s=-=Shawe-Taylor et al., 1998-=-), such as the function classes used by AdaBoost (see, e.g., Schapire et al., 1998, Koltchinskii and Panchenko, 2002). These upper bounds provide guidance for model selection and in particular help gu... |

177 | The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Imporatant than the Size of the Network
- Bartlett
- 1998
(Show Context)
Citation Context ...ble for the minimizer of the empirical average of the 0-1 loss. Indeed a number of such results have been obtained for function classes with infinite VC-dimension but finite fat-shattering dimension (=-=Bartlett, 1998-=-, 3sShawe-Taylor et al., 1998), such as the function classes used by AdaBoost (see, e.g., Schapire et al., 1998, Koltchinskii and Panchenko, 2002). These upper bounds provide guidance for model select... |

154 | The hardn-8 of approximate optima in lattices, codes, and systems of linear equations
- Arora, Babai, et al.
- 1997
(Show Context)
Citation Context ...imizing the sample average of the loss, ˆ R(f) = 1 �n n i=1 ℓ(Yif(Xi)). As is well known, however, such a procedure is computationally intractable for many nontrivial classes of functions (see, e=-=.g., Arora et al., 1997).-=- Indeed, the loss function ℓ(Y f(X)) is non-convex in its (scalar) argument, and, while not a proof, this suggests a source of the difficulty. Moreover, it suggests that we might base a tractable es... |

153 | Optimal aggregation of classifiers in statistical learning - Tsybakov - 2004 |

151 |
Fundamentals of Statistical Exponential Families, with Applications in Statistical Decision Theory
- Brown
- 1986
(Show Context)
Citation Context ...troduction of these new techniques (Boyd and Vandenberghe, 2003). Convexity arises in many guises in statistics as well, notably in properties associated with the exponential family of distributions (=-=Brown, 1986-=-). It is, however, only in recent years that the systematic exploitation of the algorithmic consequences of convexity has begun in statistics. One applied area in which this trend has been most salien... |

141 | Sharper bounds for Gaussian and empirical processes. Annals of Probability - Talagrand - 1994 |

131 |
Uniform Central Limit Theorems
- Dudley
- 1999
(Show Context)
Citation Context ...ce (S, d), let N (ɛ, A, d) denote the cardinality of the smallest ɛ-cover of A, that is, the smallest set Â ⊂ S for which every a ∈ A has some â ∈ Â with d(a, â) ≤ ɛ. Using Dudley’s e=-=ntropy integral (Dudley, 1999), Mendels-=-on (2002) has shown the following result: Suppose that F is a set of [−1, 1]-valued functions on X , and there is a γ > 0 and 0 < p < 2 for which sup N (ɛ, F, L2(P )) ≤ γɛ P −p , where the s... |

117 | Empirical margin distributions and bounding the generalization error of combined classifiers
- Koltchinskii, Panchenko
(Show Context)
Citation Context ... classes with infinite VC-dimension but finite fat-shattering dimension (Bartlett, 1998, 3sShawe-Taylor et al., 1998), such as the function classes used by AdaBoost (see, e.g., Schapire et al., 1998, =-=Koltchinskii and Panchenko, 2002-=-). These upper bounds provide guidance for model selection and in particular help guide data-dependent choices of regularization parameters. To carry this agenda further, it is necessary to find gener... |

111 | Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–134. MR2051001 X - ZHANG - 2004 |

108 | Local Rademacher complexities - Bartlett, Bousquet, et al. |

87 | Some applications of concentration inequalities to statistics - Massart - 2000 |

81 | Boosting and maximum likelihood for exponential models
- Lebanon, La®erty
- 2002
(Show Context)
Citation Context ...cular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Schölkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins et al., 2002,=-= Lebanon and Lafferty, 2002-=-), and variational inference for graphical models (Jordan et al., 1999) are all based directly on ideas from convex optimization. If algorithms from convex optimization are to continue to make inroads... |

77 | About the constants in Talagrand’s concentration inequalities for empirical processes - Massart |

68 | Efficient agnostic learning of neural networks with bounded fan-in - Lee, Bartlett, et al. - 1996 |

57 | A Bennett concentration inequality and its application to suprema of empirical processes - Bousquet - 2002 |

52 | Consistency of support vector machines and other regularized kernel machines - STEINWART - 2005 |

40 | Rademacher processes and bounding the risk of function learning - Koltchinskii, Panchenko - 2000 |

37 | Process consistency for adaboost - JIANG - 2004 |

36 | Improving the sample complexity using global data - Mendelson - 2002 |

33 |
Sequential greedy approximation for certain convex optimization problems
- Zhang
(Show Context)
Citation Context ...gosi and Vayatis (2003). Regarding the numerical optimization to determine ˆ f, Zhang and Yu (2003) give novel bounds on the convergence rate for generic forward stagewise additive modeling (see also=-= Zhang, 2002-=-). These authors focus on optimization of a convex risk functional over the entire linear hull of a base class, with regularization enforced by an early stopping rule. Acknowledgments We would like to... |

30 | A note on margin-based loss functions in classification,” Statist
- Lin
- 2004
(Show Context)
Citation Context ...tatistical consequences. Thus, we consider the weakest possible condition on φ: that it is “classification-calibrated,” which is essentially a pointwise form of Fisher consistency for classificat=-=ion (Lin, 2001). In part-=-icular, if we define η(x) = P (Y = 1|X = x), then φ is classification-calibrated if, for η(x) �= 1/2, the minimizer f ∗ of the conditional expectation E[φ(Y f ∗ (X))|X = x] has the same sign... |

29 | Some infinity theory for predictor ensembles - BREIMAN - 2000 |

26 | Inégalités de concentration pour les processus empiriques de classes de parties, Probability Theory and Related Fields 119 - Rio - 2000 |

24 | Complexity regularization via localized random penalties - Lugosi, Wegkamp - 2004 |

23 | Distance weighted discrimination - Marron, Todd - 2002 |

18 | The consistency of greedy algorithms for classification - MANNOR, MEIR, et al. - 2002 |

13 | On (psi)-learning - Shen, Tseng, et al. |

10 | Prediction games and arcing algorithms. Neural Computation, 11(7): 1493 1517, Oct 1999. H.D. Brunk. Maximum likelihood estimates of monotone parameters - Breiman - 1955 |

9 | Local rademacher complexities. The Annals of Statistics - Bartlett, Bousquet, et al. - 2005 |

9 | Empirical minimization. Probability Theory and Related Fields - Bartlett, Mendelson |

8 | Une inégalité de concentration gauche pour les processus empiriques - Klein - 2002 |

8 | On the Bayes risk consistency of regularized boosting methods. Annals of Statistics - Lugosi, Vayatis - 2003 |

6 | A Note on Margin-based Loss Functions
- Lin
- 2004
(Show Context)
Citation Context ...statistical consequences. Thus, we consider the weakest possible condition on φ: that it is “classificationcalibrated,” which is essentially a pointwise form of Fisher consistency for classification (=-=Lin, 2004-=-). 4sIn particular, if we define η(x) = P(Y = 1|X = x), then φ is classification-calibrated if, for x such that η(x) �= 1/2, every minimizer f ∗ of the conditional expectation E[φ(Y f ∗ (X))|X = x] ha... |

6 | Smooth discrimination analysis. The Annals of Statistics, 27:1808–1829 - Mammen, Tsybakov - 1999 |

3 |
An Introduction to Support Vector Methods
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...s an imperative. Many of the most prominent methods studied in machine learning make significant use of convexity; in particular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, =-=Cristianini and Shawe-Taylor, 2000,-=- Schölkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins et al., 2002, Lebanon and Lafferty, 2002), and variational inference for graphical models (Jordan et al., 1999) are all based... |

3 | Geometric bounds for generalization in boosting - Mannor, Meir - 2001 |

1 |
Special issue on machine learning methods in signal processing
- Feder, Figueiredo, et al.
- 2004
(Show Context)
Citation Context ...are all based directly on ideas from convex optimization. These methods have had significant practical successes in applied areas such as bioinformatics, information management and signal processing (=-=Feder et al., 2004-=-, Joachims, 2002, Schölkopf et al., 2003). If algorithms from convex optimization are to continue to make inroads into statistical theory and practice, it is important that we understand these algorit... |