## Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization (2001)

Citations: | 113 - 6 self |

### BibTeX

@MISC{Zhang01statisticalbehavior,

author = {Tong Zhang},

title = {Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

We study how close the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (non maximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization.

### Citations

8973 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...n error function I (p, y). For example, AdaBoost [7] employs the exponential loss function exp(−py) [2, 3, 14, 7], and support vector machines (SVMs) employ a loss function of the form max(1 − py, 0) =-=[16]-=-. In general, let φ be a one variable convex function. We may consider the (approximate) minimization in a function class C with respect to the following empirical risk [In this paper, we only conside... |

3265 | Variational Analysis
- Rockafellar, Wets
- 1997
(Show Context)
Citation Context ...definition, φ ′ (f ) in general denotes a subgradient of a convex function φ(f ) at f . A subgradient p ∗ of a convex function φ(f ) at p is a value such that φ(q) ≥ φ(p) + p ∗ (q − p) for all q (see =-=[11]-=-, Section 23). Clearly, by definition, the Bregman divergence is always nonnegative. However, in general, a subgradient of a convex function at a point may not always exist, and even when it exists it... |

2308 | A decision-theoretic generalization of online learning and an application to boosting - Freund, Schapire - 1997 |

1367 | and Complex Analysis - Rudin, Real - 1974 |

1272 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ...x, ·)〉, where we use 〈·, ·〉 to denote the inner product in H . This fact will be used in the next section. For further information on reproducing kernel Hilbert spaces, we refer interested readers to =-=[17]-=-. We now consider kernel functions of the form Kh([x1, b1], [x2, b2]) = h(x T 1 x2 + b1b2), where h can be expressed as a Taylor’s expansion with nonnegative coefficients. It is well known that Kh is ... |

1219 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
- 2000
(Show Context)
Citation Context ...a number of methods have been proposed to alleviate this computational problem. The basic idea is to minimize a convex upper bound of the classification error function I (p, y). For example, AdaBoost =-=[7]-=- employs the exponential loss function exp(−py) [2, 3, 14, 7], and support vector machines (SVMs) employ a loss function of the form max(1 − py, 0) [16]. In general, let φ be a one variable convex fun... |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ... minimize the true classification error. So far, the most influential explanation of their success is the so-called “margin” analysis. This concept has been used to explain both SVM [16] and boosting =-=[13]-=-. The basic idea is that using convex risk minimization one attempts to separate the value of f (x) for in-class data and out-of-class as much as possible. However, in a statistical estimation procedu... |

698 | Improved Boosting Algorithms using Confidence-rated Predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ... this computational problem. The basic idea is to minimize a convex upper bound of the classification error function I (p, y). For example, AdaBoost [7] employs the exponential loss function exp(−py) =-=[2, 3, 14, 7]-=-, and support vector machines (SVMs) employ a loss function of the form max(1 − py, 0) [16]. In general, let φ be a one variable convex function. We may consider the (approximate) minimization in a fu... |

259 |
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...) ≥ 0)], where 1(·) denotes the set indicator function. Ef (x)<0 is similarly defined. For convenience, we also introduce the notation: (7) Q(η, f ) = ηφ(f ) + (1 − η)φ(−f ), where we assume that η ∈ =-=[0, 1]-=-. Let R∗ denote the extended real line (R∗ = R ∪ {−∞, +∞}). We extend a convex function g : R → R to a function g : R∗ → R∗ by defining g(∞) = limx→∞ g(x) and g(−∞) = limx→−∞ g(x). The extension is on... |

137 | Prediction games and arcing algorithms
- Breiman
- 1999
(Show Context)
Citation Context ... this computational problem. The basic idea is to minimize a convex upper bound of the classification error function I (p, y). For example, AdaBoost [7] employs the exponential loss function exp(−py) =-=[2, 3, 14, 7]-=-, and support vector machines (SVMs) employ a loss function of the form max(1 − py, 0) [16]. In general, let φ be a one variable convex function. We may consider the (approximate) minimization in a fu... |

121 | Boosting with the l2-loss: Regression and classification
- Bühlmann, Yu
(Show Context)
Citation Context ...eneral loss functions. This question can be answered using our analysis. In another related work, Bühlmann and Yu investigated various theoretical issues for the least squares formulation of boosting =-=[5]-=- and argued that the procedure can be as effective as other methods. However, unlike this paper, they did not focus on the approximation error aspect. In fact, we will see later that using other loss ... |

118 | Multilayer feed forward networks with a no polynomial activation function can approximate any function
- Leshno, Lin, et al.
- 1993
(Show Context)
Citation Context ...networks (with sigmoidal activation function h) are universal approximators. In this paper, we use the following general version of a neural network universal approximation result which was proved in =-=[8]-=-: THEOREM 4.2 ([8]). If h is a nonpolynomial continuous function, then Ch is dense in C(U) for all compact subsets U of Rd . We should mention that in the original theorem, the density result is state... |

75 | Boosting the margin: A new explanation for the eectiveness of voting methods - Schapire, Freund, et al. - 1998 |

71 | Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions - Robert - 1999 |

66 | On the bayes-risk consistency of regularized boosting methods
- Lugosi, Vayatis
- 2004
(Show Context)
Citation Context ...rty of the least squares method. At about the same time of this work, Lugosi and Vayatis studied the consistency issue for certain forms of boosting methods using ideas similar to what we employ here =-=[9]-=-. However, the framework developed in their paper is quite different from this work. In particular, they did not study certain issues discussed here, such as the interpretation of convex risk minimiza... |

47 |
Real and Complex Analysis, 3rd ed
- RUDIN
- 1987
(Show Context)
Citation Context ...in ( max(η(x), δ), 1 − δ ) , |f ∗ φ (η)|, Mδ = sup |φ(z)| + 1. |z|≤Kδ Using the assumptions on f ∗ φ and φ, we have Kδ, Mδ < +∞. Since µ is regular, using Lusin’s theorem in measure theory (e.g., see =-=[12]-=-, page 55), we know that f ∗ φ (ηδ(x)) can be approximated by a continuous function α ′(x) ∈ C(U) such that |α′(x)| ≤ Kδ and P (f ∗ φ (ηδ(x)) = α ′(x)) ≤ ε/(2Mδ). This implies that EXQ ( η(X), α ′ (X... |

46 |
Support vector machines are universally consistent
- Steinwart
(Show Context)
Citation Context ... consistency of boosting-like procedures using results of this paper. For example, see [10]. For the support vector machine formulation, Ingo Steinwart independently obtained universal consistency in =-=[15]-=- using a different approach but without convergence rate results such as those in Section 4. The goal of this paper is to study the impact of a convex loss function φ in an estimation scheme that appr... |

29 | Arcing classifiers (with discussion - Breiman - 1998 |

29 |
Some Infinity Theory for Predictor Ensembles
- Breiman
- 2000
(Show Context)
Citation Context ...ave started to investigate issues related to what we are interested in here. To our knowledge, Breiman is the first person to consider the consistency issue for boosting type algorithms. He showed in =-=[4]-=- that in the infinite sample case an arcing-style greedyCONSISTENCY OF CLASSIFICATION METHODS 59 approximation procedure using the AdaBoost exponential loss function converges to the Bayes classifier... |

18 | The consistency of greedy algorithms for classification
- Mannor, Meir, et al.
- 2002
(Show Context)
Citation Context ...tional probability estimate and the associated analysis of loss functions. It is also possible to demonstrate the consistency of boosting-like procedures using results of this paper. For example, see =-=[10]-=-. For the support vector machine formulation, Ingo Steinwart independently obtained universal consistency in [15] using a different approach but without convergence rate results such as those in Secti... |

10 | The relaxation method of the common point of convex sets and its application to the solution of problems in convex programming - Bregman - 1967 |

9 | Arcing classi Annals of statistics - Breiman - 1998 |

9 | Some in theory for predictor ensembles - Breiman - 2000 |

8 | A leave-one-out cross validation bound for kernel methods with applications in learning
- ZHANG
- 2001
(Show Context)
Citation Context ...o the minimization of (4). Although (15) is formulated as an infinite dimensional optimization problem, it is well known that the computation can be performed in a finite dimensional space (e.g., see =-=[17, 18]-=-). Let f ⊥ be the orthogonal projection of f onto the subspace VX spanned by gi(x) = h(XT i x + 1) ∈ Hh ¯ (i = 1, . . . , n). Then by definition f (Xi) − f ⊥ (Xi) = 〈(f − f ⊥ ), gi〉 = 0. Since ∀ f /∈ ... |

8 | On the Bayes-risk consistency of boosting methods - Lugosi, Vayatis - 2001 |

3 | The consistency of greedy algorithms for classi - Mannor, Meir, et al. - 2002 |

1 | Boosting with the l2 loss: Regression and classi - Buhlmann, Yu - 2001 |