Results 1 - 10
of
24
Stability and Generalization
, 2001
"... We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classif ..."
Abstract
-
Cited by 124 (6 self)
- Add to MetaCart
We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.
On The Strong Universal Consistency Of Nearest Neighbor Regression Function Estimates
- Annals of Statistics
, 1994
"... this paper is the proof that condition (d) implies DEVROYE, GY)RFI, KRZYZAK AND LUGOSI (a). We begin with an exponential inequality generalizing inequalities due to Hoeffding (1963). The generalization due to Azuma (1967) [see Stout (1974)] has led to interesting applications in combinatorics and t ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
this paper is the proof that condition (d) implies DEVROYE, GY)RFI, KRZYZAK AND LUGOSI (a). We begin with an exponential inequality generalizing inequalities due to Hoeffding (1963). The generalization due to Azuma (1967) [see Stout (1974)] has led to interesting applications in combinatorics and the theory of random graphs [for a survey, see McDiarmid (1989)]. We have used it in density estimation [Devroye (1988, 1991)]
Convergence Properties of Functional Estimates for Discrete Distributions
, 2001
"... Suppose P is an arbitrary discrete distribution on a countable alphabet \X. Given an i.i.d. sample (X_1,...,X_n) drawn from P, we consider the problem of estimating the entropy H(P) or some other functional F=F(P) of the unknown distribution P. We show that, for additive functionals satisfying mild ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Suppose P is an arbitrary discrete distribution on a countable alphabet \X. Given an i.i.d. sample (X_1,...,X_n) drawn from P, we consider the problem of estimating the entropy H(P) or some other functional F=F(P) of the unknown distribution P. We show that, for additive functionals satisfying mild conditions (including the cases of the mean, the entropy, and mutual information), the plug-in estimates of F are universally consistent. We also prove that, without further assumptions, no rate-of-convergence results can be obtained for any sequence of estimators. In the case of entropy estimation, under a variety of different assumptions, we get rate-of-convergence results for the plug-in estimate and for a nonparametric estimator based on match-lengths. The behavior of the variance and the expected error of the plug-in estimate is shown to be in sharp contrast to the finite-alphabet case. A number of other important examples of functionals are also treated in some detail.
Generalization bounds for the area under the ROC curve
- Journal of Machine Learning Research
"... We study generalization properties of the area under an ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for bipartite ranking problems. The AUC is a different and more complex term than the error rate used for evaluation in classification problems; consequently, existi ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
We study generalization properties of the area under an ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for bipartite ranking problems. The AUC is a different and more complex term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define a precise notion of the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard shatter coefficients (also known variously as the counting numbers or growth function) in uniform convergence results for the classification error rate. We also compare our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC; as we show, the bound provided by our result is considerably tighter. 1 1
Concentration inequalities
- Advanced Lectures in Machine Learning
, 2004
"... Abstract. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. In the last decade new tools have been introduced making it possible to establish simple and powerful inequalities. These inequalities are at the heart of the mathematical a ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Abstract. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. In the last decade new tools have been introduced making it possible to establish simple and powerful inequalities. These inequalities are at the heart of the mathematical analysis of various problems in machine learning and made it possible to derive new efficient algorithms. This text attempts to summarize some of the basic tools. 1
Universal smoothing factor selection in density estimation: theory and practice (with discussion
- Test
, 1997
"... In earlier work with Gabor Lugosi, we introduced a method to select a smoothing factor for kernel density estimation such that, for all densities in all dimensions, the L1 error of the corresponding kernel estimate is not larger than 3+e times the error of the estimate with the optimal smoothing fac ..."
Abstract
-
Cited by 19 (10 self)
- Add to MetaCart
In earlier work with Gabor Lugosi, we introduced a method to select a smoothing factor for kernel density estimation such that, for all densities in all dimensions, the L1 error of the corresponding kernel estimate is not larger than 3+e times the error of the estimate with the optimal smoothing factor plus a constant times Ov~--~-n/n, where n is the sample size, and the constant only depends on the complexity of the kernel used in the estimate. The result is nonasymptotic, that is, the bound is valid for each n. The estimate uses ideas from the minimum distance estimation work of Yatracos. We present a practical implementation of this estimate, report on some comparative results, and highlight some key properties of the new method.
Pattern classification and learning theory
"... 1.1 A binary classification problem Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
1.1 A binary classification problem Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation is called a class. It is denoted by y and takes values in the set f0; 1g. (For simplicity, we restrict our attention to binary classification.) In pattern recognition, one creates a function g(x) : R d! f0; 1g which represents one's guess of y given x. The mapping g is called a classifier. A classifier errs on x if g(x) 6 = y. To model the learning problem, we introduce a probabilistic setting, and let (X; Y) be an R d \Theta f0; 1g-valued random pair. The random pair (X; Y) may be described in a variety of ways: for example, it is defined by the pair (_; j), where _ is the probability measure for X and j is the regression of Y on X. More precisely, for a Borel-measurable set A ` R d
Large Deviations of Divergence Measures on Partitions
, 2000
"... We discuss Chernoff-type large deviation results for the total variation, the I-divergence errors, and the -divergence errors on partitions. In contrast to the total variation and the I-divergence, the divergence has an unconventional large deviation rate. Applications to Bahadur efficiencies ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We discuss Chernoff-type large deviation results for the total variation, the I-divergence errors, and the -divergence errors on partitions. In contrast to the total variation and the I-divergence, the divergence has an unconventional large deviation rate. Applications to Bahadur efficiencies of goodness-of-fit tests based on these divergence measures for multivariate observations are given.

