Results 1 - 10
of
17
How to compare different loss functions and their risks
, 2006
"... Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widely-known approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from subst ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widely-known approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from substantial problems such as computational requirements in classification or robustness concerns in regression. In order to resolve these issues many successful learning algorithms try to minimize a (modified) empirical risk of a surrogate loss function, instead. Of course, such a surrogate loss must be “reasonably related ” to the original loss function since otherwise this approach cannot work well. For classification good surrogate loss functions have been recently identified, and the relationship between the excess classification risk and the excess risk of these surrogate loss functions has been exactly described. However, beyond the classification problem little is known on good surrogate loss functions up to now. In this work we establish a general theory that provides powerful tools for comparing excess risks of different loss functions. We then apply this theory to several learning problems including (cost-sensitive) classification, regression, density estimation, and density level detection.
On the Consistency of Ranking Algorithms
"... We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency ev ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency even in low-noise settings. We present a newvalue-regularizedlinear loss, establishits consistency under reasonable assumptions on noise, and show that it outperforms conventional ranking losses in a collaborative filtering experiment. The goal in ranking is to order a set of inputs in accordance with the preferences of an individual or a population. In this paper we consider a general formulation of the supervised ranking problem in which each training example consists of a query q, a set of inputs x, sometimes called results, and a weighted graph G representing preferences over the results. The learning task is to discover a function that provides a queryspecific ordering of the inputs that best respects the observed preferences. This query-indexed setting is natural for tasks like web search in which a different ranking is needed for each query. Following existing literature, we assume the existence of a scoring function f(x,q) that gives a score to each result in x; the scoresaresortedtoproducearanking(Herbrich et al., 2000; Freund et al., 2003). We assume simply that the observed preference graph G is a directed acyclic graph (DAG). Finally, we cast our work in a decisiontheoretic framework in which ranking procedures are evaluated via a loss function L(f(x,q),G).
Relative novelty detection
- Twelfth International Conference on Artificial Intelligence and Statistics, volume 5 of JMLR Workshop and Conference Proceedings
, 2009
"... Novelty detection is an important tool for unsupervised data analysis. It relies on finding regions of low density within which events are then flagged as novel. By design this is dependent on the underlying measure of the space. In this paper we derive a formulation which is able to address this pr ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Novelty detection is an important tool for unsupervised data analysis. It relies on finding regions of low density within which events are then flagged as novel. By design this is dependent on the underlying measure of the space. In this paper we derive a formulation which is able to address this problem by allowing for a reference measure to be given in the form of a sample from an alternative distribution. We show that this optimization problem can be solved efficiently and that it works well in practice. 1
VC Theory of Large Margin Multi-Category Classifiers
"... In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a naïve extension of the results devoted to the computation of dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines.
A framework for kernel-based multi-category classification
, 2005
"... A geometric framework for understanding multi-category classification is introduced, through which many existing ‘all-together ’ algorithms can be understood. The structure allows the derivation of a parsimonious optimisation function, which is a direct extension of the binary ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A geometric framework for understanding multi-category classification is introduced, through which many existing ‘all-together ’ algorithms can be understood. The structure allows the derivation of a parsimonious optimisation function, which is a direct extension of the binary
ABC-Boost: Adaptive Base Class Boost for Multi-class Classification
"... We propose abc-boost (adaptive base class boost) for multi-class classification and present abc-mart, an implementation of abcboost, based on the multinomial logit model. The key idea is that, at each boosting iteration, we adaptively and greedily choose a base class. Our experiments on public datas ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We propose abc-boost (adaptive base class boost) for multi-class classification and present abc-mart, an implementation of abcboost, based on the multinomial logit model. The key idea is that, at each boosting iteration, we adaptively and greedily choose a base class. Our experiments on public datasets demonstrate the improvement of abc-mart over the original mart algorithm. 1.
Radius-margin bound on the leave-one-out error of multi-class SVMs
- n o RR-5780, INRIA, 2005, http://www.inria.fr/rrrt/rr-5780.html. Bibliography in notes
"... Using a support vector machine (SVM) requires to set the values of two types of hyperparameters: the soft margin parameter C and the parameters of the kernel. To perform this model selection task, the method of choice is cross-validation. Its leave-one-out variant is known to produce an estimator of ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Using a support vector machine (SVM) requires to set the values of two types of hyperparameters: the soft margin parameter C and the parameters of the kernel. To perform this model selection task, the method of choice is cross-validation. Its leave-one-out variant is known to produce an estimator of the generalization error which is almost unbiased. Its major drawback rests in its time requirement. To overcome this difficulty, several upper bounds on the leave-one-out error of the pattern recognition SVM have been derived. Among those bounds, the most popular one is probably the radius-margin bound. In this report, we establish a generalized radius-margin bound dedicated to the multi-class SVM of Lee, Lin and Wahba. Keywords:
Notes on the generalisation performance and Fisher consistency of multicategory classifiers
, 2007
"... Existing bounds on the generalisation performance of multicategory classifiers are reviewed and considered in the light of the framework of Hill and Doucet (2005). Insights obtained through the use of this framework are used to further refine these bounds. Similarly, insights into the Fisher consist ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Existing bounds on the generalisation performance of multicategory classifiers are reviewed and considered in the light of the framework of Hill and Doucet (2005). Insights obtained through the use of this framework are used to further refine these bounds. Similarly, insights into the Fisher consistency of multicategory classifiers which can be obtained from the framework are discussed.
Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
"... The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function.
Coherence Functions for Multicategory Margin-based Classification Methods
"... Margin-based classification methods are typically devised based on a majorizationminimization procedure, which approximately solves an otherwise intractable minimization problem defined with the 0-l loss. The extension of such methods from the binary classification setting to the more general multic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Margin-based classification methods are typically devised based on a majorizationminimization procedure, which approximately solves an otherwise intractable minimization problem defined with the 0-l loss. The extension of such methods from the binary classification setting to the more general multicategory setting turns out to be nontrivial. In this paper, our focus is to devise margin-based classification methods that can be seamlessly applied to both settings, with the binary setting simply as a special case. In particular, we propose a new majorization loss function that we call the coherence function, and then devise a new multicategory margin-based boosting algorithm based on the coherence function. Analogous to deterministic annealing, the coherence function is characterized by a temperature factor. It is closely related to the multinomial log-likelihood function and its limit at zero temperature corresponds to a multicategory hinge loss function. 1

