Results 1  10
of
46
Stability and Generalization
, 2001
"... We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leaveoneout error. The methods we use can be applied in the regression framework as well as in the classification one when the classif ..."
Abstract

Cited by 260 (8 self)
 Add to MetaCart
We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leaveoneout error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a realvalued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and KullbackLeibler regularization. We demonstrate how to apply the results to SVM for regression and classification.
A Generalized Representer Theorem
 In Proceedings of the Annual Conference on Computational Learning Theory
, 2001
"... Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and ..."
Abstract

Cited by 221 (18 self)
 Add to MetaCart
(Show Context)
Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a selfcontained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.
Regularization and semisupervised learning on large graphs
 In COLT
, 2004
"... Abstract. We consider the problem of labeling a partially labeled graph. This setting may arise in a number of situations from survey sampling to information retrieval to pattern recognition in manifold settings. It is also of potential practical importance, when the data is abundant, but labeling i ..."
Abstract

Cited by 147 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of labeling a partially labeled graph. This setting may arise in a number of situations from survey sampling to information retrieval to pattern recognition in manifold settings. It is also of potential practical importance, when the data is abundant, but labeling is expensive or requires human assistance. Our approach develops a framework for regularization on such graphs. The algorithms are very simple and involve solving a single, usually sparse, system of linear equations. Using the notion of algorithmic stability, we derive bounds on the generalization error and relate it to structural invariants of the graph. Some experimental results testing the performance of the regularization algorithm and the usefulness of the generalization bound are presented. 1
Consistency of support vector machines and other regularized kernel classifiers
, 2002
"... ..."
(Show Context)
AlmostEverywhere Algorithmic Stability and Generalization Error
 In UAI2002: Uncertainty in Artificial Intelligence
, 2002
"... We introduce a new notion of algorithmic stability, which we call training stability. ..."
Abstract

Cited by 58 (8 self)
 Add to MetaCart
(Show Context)
We introduce a new notion of algorithmic stability, which we call training stability.
Magnitudepreserving ranking algorithms
, 2007
"... This paper studies the learning problem of ranking when one wishes not just to accurately predict pairwise ordering but also preserve the magnitude of the preferences or the difference between ratings, a problem motivated by its key importance in the design of search engines, movie recommendation, a ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
This paper studies the learning problem of ranking when one wishes not just to accurately predict pairwise ordering but also preserve the magnitude of the preferences or the difference between ratings, a problem motivated by its key importance in the design of search engines, movie recommendation, and other similar ranking systems. We describe and analyze several algorithms for this problem and give stability bounds for their generalization error, extending previously known stability results to nonbipartite ranking and magnitude of preferencepreserving algorithms. We also report the results of experiments comparing these algorithms on several datasets and compare these results with those obtained using an algorithm minimizing the pairwise misranking error and standard regression. 1.
Feature selection with ensembles, artificial variables, and redundancy elimination
 JMLR
, 2009
"... Predictive models benefit from a compact, nonredundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge f ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Predictive models benefit from a compact, nonredundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using treebased ensembles to generate a compact subset of nonredundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach.
Extensions to McDiarmid’s inequality when differences are bounded with high probability
"... The method of independent bounded differences (McDiarmid, 1989) gives largedeviation concentration bounds for multivariate functions in terms of the maximum effect that changing one coordinate of the input can have on the output. This method has been widely used in combinatorial applications, and in ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
The method of independent bounded differences (McDiarmid, 1989) gives largedeviation concentration bounds for multivariate functions in terms of the maximum effect that changing one coordinate of the input can have on the output. This method has been widely used in combinatorial applications, and in learning theory. In some recent applications to the theory of algorithmic stability (Kutin and Niyogi, 2002), we need to consider the case where changing one coordinate of the input usually leads to a small change in the output, but not always. We prove two extensions to McDiarmid’s inequality. The first applies when, for most inputs, any small change leads to a small change in the output. The second applies when, for a randomly selected input and a random onecoordinate change, the change in the output is usually small. 1
On the impact of kernel approximation on learning accuracy
 Conference on Artificial Intelligence and Statistics
, 2010
"... Kernel approximation is commonly used to scale kernelbased algorithms to applications containing as many as several million instances. This paper analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms. We give stability ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
Kernel approximation is commonly used to scale kernelbased algorithms to applications containing as many as several million instances. This paper analyzes the effect of such approximations in the kernel matrix on the hypothesis generated by several widely used learning algorithms. We give stability bounds based on the norm of the kernel approximation for these algorithms, including SVMs, KRR, and graph Laplacianbased regularization algorithms. These bounds help determine the degree of approximation that can be tolerated in the estimation of the kernel matrix. Our analysis is general and applies to arbitrary approximations of the kernel matrix. However, we also give a specific analysis of the Nyström lowrank approximation in this context and report the results of experiments evaluating the quality of the Nyström lowrank kernel approximation when used with ridge regression. 1
The interaction of stability and weakness in AdaBoost
, 2001
"... We provide an analysis of AdaBoost within the framework of algorithmic stability. In particular, we show that AdaBoost is a stabilitypreserving operation: if the \input" (the weak learner) to AdaBoost is stable, then the \output" (the strong learner) is almosteverywhere stable. Because cl ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
We provide an analysis of AdaBoost within the framework of algorithmic stability. In particular, we show that AdaBoost is a stabilitypreserving operation: if the \input" (the weak learner) to AdaBoost is stable, then the \output" (the strong learner) is almosteverywhere stable. Because classier combination schemes such as AdaBoost have greatest eect when the weak learner is weak, we discuss weakness and its implications. We also show that the notion of almosteverywhere stability is sucient for good bounds on generalization error. These bounds hold even when the weak learner has innite VC dimension. 1