Results 1  10
of
58
Good Practice in LargeScale Learning for Image Classification
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI)
, 2013
"... We benchmark several SVM objective functions for largescale image classification. We consider onevsrest, multiclass, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods i ..."
Abstract

Cited by 43 (7 self)
 Add to MetaCart
(Show Context)
We benchmark several SVM objective functions for largescale image classification. We consider onevsrest, multiclass, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods in terms of classification accuracy, but with a significant gain in training speed. Using stochastic gradient descent, we can scale the training to millions of images and thousands of classes. Our experimental evaluation shows that rankingbased algorithms do not outperform the onevsrest strategy when a large number of training examples are used. Furthermore, the gap in accuracy between the different algorithms shrinks as the dimension of the features increases. We also show that learning through crossvalidation the optimal rebalancing of positive and negative examples can result in a significant improvement for the onevsrest strategy. Finally, early stopping can be used as an effective regularization strategy when training with online algorithms. Following these “good practices”, we were able to improve the stateoftheart on a large subset of 10K classes and 9M images of ImageNet from 16.7 % Top1 accuracy to 19.1%.
Statistical analysis of Bayes optimal subset ranking
 IEEE Transactions on Information Theory
, 2008
"... Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain) criterion that measures the quality of items near the top of the ranklist. Similar to error minimization for binary classification, direct optimization of natural ranking criteria such as DCG leads to a nonconvex optimization problems that can be NPhard. Therefore a computationally more tractable approach is needed. We present bounds that relate the approximate optimization of DCG to the approximate minimization of certain regression errors. These bounds justify the use of convex learning formulations for solving the subset ranking problem. The resulting estimation methods are not conventional, in that we focus on the estimation quality in the topportion of the ranklist. We further investigate the asymptotic statistical behavior of these formulations. Under appropriate conditions, the consistency of the estimation schemes with respect to the DCG metric can be derived. I.
Multiinstance multilabel learning
 Artificial Intelligence
"... In this paper, we propose the MIML (MultiInstance MultiLabel learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicate ..."
Abstract

Cited by 33 (14 self)
 Add to MetaCart
In this paper, we propose the MIML (MultiInstance MultiLabel learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Consideringthat the degeneration process may lose information, we propose the DMimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming singleinstances into the MIML representation for learning, while SubCod works by transforming singlelabel examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the singleinstances or singlelabel examples directly.
How to compare different loss functions and their risks
, 2006
"... Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widelyknown approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from subst ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widelyknown approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from substantial problems such as computational requirements in classification or robustness concerns in regression. In order to resolve these issues many successful learning algorithms try to minimize a (modified) empirical risk of a surrogate loss function, instead. Of course, such a surrogate loss must be “reasonably related ” to the original loss function since otherwise this approach cannot work well. For classification good surrogate loss functions have been recently identified, and the relationship between the excess classification risk and the excess risk of these surrogate loss functions has been exactly described. However, beyond the classification problem little is known on good surrogate loss functions up to now. In this work we establish a general theory that provides powerful tools for comparing excess risks of different loss functions. We then apply this theory to several learning problems including (costsensitive) classification, regression, density estimation, and density level detection.
On the Consistency of Ranking Algorithms
"... We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency ev ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency even in lownoise settings. We present a newvalueregularizedlinear loss, establishits consistency under reasonable assumptions on noise, and show that it outperforms conventional ranking losses in a collaborative filtering experiment. The goal in ranking is to order a set of inputs in accordance with the preferences of an individual or a population. In this paper we consider a general formulation of the supervised ranking problem in which each training example consists of a query q, a set of inputs x, sometimes called results, and a weighted graph G representing preferences over the results. The learning task is to discover a function that provides a queryspecific ordering of the inputs that best respects the observed preferences. This queryindexed setting is natural for tasks like web search in which a different ranking is needed for each query. Following existing literature, we assume the existence of a scoring function f(x,q) that gives a score to each result in x; the scoresaresortedtoproducearanking(Herbrich et al., 2000; Freund et al., 2003). We assume simply that the observed preference graph G is a directed acyclic graph (DAG). Finally, we cast our work in a decisiontheoretic framework in which ranking procedures are evaluated via a loss function L(f(x,q),G).
Relative novelty detection
 Twelfth International Conference on Artificial Intelligence and Statistics, volume 5 of JMLR Workshop and Conference Proceedings
, 2009
"... Novelty detection is an important tool for unsupervised data analysis. It relies on finding regions of low density within which events are then flagged as novel. By design this is dependent on the underlying measure of the space. In this paper we derive a formulation which is able to address this pr ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Novelty detection is an important tool for unsupervised data analysis. It relies on finding regions of low density within which events are then flagged as novel. By design this is dependent on the underlying measure of the space. In this paper we derive a formulation which is able to address this problem by allowing for a reference measure to be given in the form of a sample from an alternative distribution. We show that this optimization problem can be solved efficiently and that it works well in practice. 1
Largescale image classification with tracenorm regularization
 IEEE Conference on Computer Vision & Pattern Recognition (CVPR
, 2012
"... With the advent of larger image classification datasets such as ImageNet, designing scalable and efficient multiclass classification algorithms is now an important challenge. We introduce a new scalable learning algorithm for largescale multiclass image classification, based on the multinomial l ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
With the advent of larger image classification datasets such as ImageNet, designing scalable and efficient multiclass classification algorithms is now an important challenge. We introduce a new scalable learning algorithm for largescale multiclass image classification, based on the multinomial logistic loss and the tracenorm regularization penalty. Reframing the challenging nonsmooth optimization problem into a surrogate infinitedimensional optimization problem with a regular `1regularization penalty, we propose a simple and provably efficient accelerated coordinate descent algorithm. Furthermore, we show how to perform efficient matrix computations in the compressed domain for quantized dense visual features, scaling up to 100,000s examples, 1,000sdimensional features, and 100s of categories. Promising experimental results on the “Fungus”, “Ungulate”, and “Vehicles ” subsets of ImageNet are presented, where we show that our approach performs significantly better than stateoftheart approaches for Fisher vectors with 16 Gaussians. 1.
Composite Multiclass Losses
"... We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
(Show Context)
We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, ordersensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration ” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
Robust truncated hinge loss support vector machines. Journal of the American Statistical Association 102 974–983. MR2411659 Seo Young Park Department of Statistics and Operations Research CB3260
, 2007
"... The support vector machine (SVM) has been widely applied for classification problems in both machine learning and statistics. Despite its popularity, however, SVM has some drawbacks in certain situations. In particular, the SVM classifier can be very sensitive to outliers in the training sample. Mor ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
The support vector machine (SVM) has been widely applied for classification problems in both machine learning and statistics. Despite its popularity, however, SVM has some drawbacks in certain situations. In particular, the SVM classifier can be very sensitive to outliers in the training sample. Moreover, the number of support vectors (SVs) can be very large in many applications. To circumvent these drawbacks, we propose the robust truncated hinge loss SVM (RSVM), which uses a truncated hinge loss. The RSVM is shown to be more robust to outliers and to deliver more accurate classifiers using a smaller set of SVs than the standard SVM. Our theoretical results show that the RSVM is Fisherconsistent, even when there is no dominating class, a scenario that is particularly challenging for multicategory classification. Similar results are obtained for a class of marginbased classifiers.
VC Theory of Large Margin MultiCategory Classifiers
"... In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a naïve extension of the results devoted to the computation of dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multicategory classifiers is introduced. Central in this theory are generalized VC dimensions called the γΨdimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γΨdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scalesensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multiclass SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines.