Results 1  10
of
10
Convexity, Classification, and Risk Bounds
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2003
"... Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficien ..."
Abstract

Cited by 119 (12 self)
 Add to MetaCart
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we
Diffusion Kernels on Statistical Manifolds
, 2004
"... A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian ker ..."
Abstract

Cited by 87 (6 self)
 Add to MetaCart
A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernelbased learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification.
Consistency and convergence rates of oneclass SVM and related algorithms
, 2006
"... We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inf ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Nonasymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayesconsistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the oneclass SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator.
Classification with reject option
 Canad. J. Statist
, 2006
"... This paper studies twoclass (or binary) classification of elements X in Rk that allows for a reject option. Based on n independent copies of the pair of random variables (X, Y) with X ∈ Rk and Y ∈ {0, 1}, we consider classifiers f(X) that render three possible outputs: 0, 1 and R. The option R expr ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
This paper studies twoclass (or binary) classification of elements X in Rk that allows for a reject option. Based on n independent copies of the pair of random variables (X, Y) with X ∈ Rk and Y ∈ {0, 1}, we consider classifiers f(X) that render three possible outputs: 0, 1 and R. The option R expresses doubt and is to be used for few observations that are hard to classify in an automatic way. Chow (1970) derived the optimal rule minimizing the risk P{f(X) = Y, f(X) = R} + dP{f(X) = R}. This risk function subsumes that the cost of making a wrong decision equals 1 and that of utilizing the reject option is d. We show that the classification problem hinges on the behavior of the regression function η(x) = E(Y X = x) near d and 1 − d. (Here d ∈ [0, 1/2] as the other cases turn out to be trivial.) Classification rules can be categorized into plugin estimators and empirical risk minimizers. Both types are considered here and we prove that the rates of convergence of the risk of any estimate depends on P{η(X) − d  ≤ δ} + P{η(X) − (1 − d)  ≤ δ} and on the quality of the estimate for η or an appropriate measure of the size of the class of classifiers, in case of plugin rules and empirical risk minimizers, respectively. We extend the mathematical framework even further by differentiating between costs associated with the two possible errors: predicting f(X) = 0 whilst Y = 1 and predicting f(X) = 1 whilst Y = 0. Such situations are common in, for instance, medical studies where misclassifying a sick patient as healthy is worse than the opposite. Running title: Classification with reject option
Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations
"... We propose and analyze a novel framework for learning sparse representations, based on two statistical techniques: kernel smoothing and marginal regression. The proposed approach provides a flexible framework for incorporating feature similarity or temporal information present in data sets, via nonp ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose and analyze a novel framework for learning sparse representations, based on two statistical techniques: kernel smoothing and marginal regression. The proposed approach provides a flexible framework for incorporating feature similarity or temporal information present in data sets, via nonparametric kernel smoothing. We provide generalization bounds for dictionary learning usingsmoothsparsecodingandshowhowthe sample complexity depends on the L1 norm ofkernelfunctionused. Furthermore, wepropose using marginal regression for obtaining sparsecodes,whichsignificantlyimprovesthe speed and allows one to scale to large dictionary sizes easily. We demonstrate the advantages of the proposed approach, both in terms of accuracy and speed by extensive experimentation on several real data sets. In addition, we demonstrate how the proposed approach can be used for improving semisupervised sparse coding. 1.
Unifying Framework for Fast Learning Rate of NonSparse Multiple Kernel Learning
"... In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓpMKL that imposes ℓpmixednorm regularization instead of ℓ1mixednorm regularization. According ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓpMKL that imposes ℓpmixednorm regularization instead of ℓ1mixednorm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixednormtype regularizations in a unifying manner. As a byproduct of our general result, we show a fast learning rate of ℓpMKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1
Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees ∗
"... Rademacher penalization is a modern technique for obtaining datadependent bounds on the generalization error of classifiers. It appears to be limited to relatively simple hypothesis classes because of computational complexity issues. In this paper we, nevertheless, apply Rademacher penalization to ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Rademacher penalization is a modern technique for obtaining datadependent bounds on the generalization error of classifiers. It appears to be limited to relatively simple hypothesis classes because of computational complexity issues. In this paper we, nevertheless, apply Rademacher penalization to the in practice important hypothesis class of unrestricted decision trees by considering the prunings of a given decision tree rather than the tree growing phase. This study constitutes the first application of Rademacher penalization to hypothesis classes that have practical significance. We present two variations of the approach, one in which the hypothesis class consists of all prunings of the initial tree and another in which only the prunings that are accurate on growing data are taken into account. Moreover, we generalize the errorbounding approach from binary classification to multiclass situations. Our empirical experiments indicate that the proposed new bounds outperform distributionindependent bounds for decision tree prunings and provide nontrivial error estimates on realworld data sets.
Journal of Machine Learning Research x (xxxx) xxx Submitted xx/xx; Published xx/xx Combining PACBayesian and Generic Chaining Bounds JeanYves Audibert ∗ CERTIS, Ecole Nationale des Ponts et Chausses
"... There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements ..."
Abstract
 Add to MetaCart
There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PACBayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds.
Fast rates for Noisy Clustering Fast rates for Noisy Clustering
, 2012
"... The effect of errors in variables in empirical minimization is investigated. Given a loss l and a set of decision rules G, we prove a general upper bound for an empirical minimization based on a deconvolution kernel and a noisy sample Zi = Xi +ǫi,i = 1,...,n. We apply this general upper bound to giv ..."
Abstract
 Add to MetaCart
The effect of errors in variables in empirical minimization is investigated. Given a loss l and a set of decision rules G, we prove a general upper bound for an empirical minimization based on a deconvolution kernel and a noisy sample Zi = Xi +ǫi,i = 1,...,n. We apply this general upper bound to give the rate of convergence for the expected excess risk in noisy clustering. A recent bound from Levrard (2012) proves that this rate is O(1/n) in the direct case, under Pollard’s regularity assumptions. Here the effect of noisy measurements gives a rate of the form O(1/n γ γ+2β), where γ is the Hölder regularity of the density of X whereas β is the degree of illposedness.
1Improved Bounds for the Nyström Method with Application to Kernel Classification
"... Abstract—We develop two approaches for analyzing the approximation error bound for the Nyström method that approximates a positive semidefinite (PSD) matrix by sampling a small set of columns, one based on the concentration inequality of integral operator, and one based on the random matrix theor ..."
Abstract
 Add to MetaCart
Abstract—We develop two approaches for analyzing the approximation error bound for the Nyström method that approximates a positive semidefinite (PSD) matrix by sampling a small set of columns, one based on the concentration inequality of integral operator, and one based on the random matrix theory. We show that the approximation error, measured in the spectral norm, can be improved from O(N/ m) to O(N/m1−ρ) in the case of large eigengap, where N is the total number of data points, m is the number of sampled data points, and ρ ∈ (0, 1/2) is a positive constant that characterizes the eigengap. When the eigenvalues of the kernel matrix follow a ppower law, our analysis based on random matrix theory further improves the bound to O(N/mp−1) under an incoherence assumption. We present a kernel classification approach based on the Nyström method and derive its generalization performance using the improved bound. We show that when the eigenvalues of kernel matrix follow a ppower law, we can reduce the number of support vectors to N2p/(p 2−1), which is sublinear in N when p> 1 + 2, without seriously sacrificing its generalization performance. Index Terms—Nyström method, approximation error, concentration inequality, kernel methods, random matrix theory I.