Results 1 -
9 of
9
Algorithmic Stability and Uniform Generalization
"... Abstract One of the central questions in statistical learning theory is to determine the conditions under which agents can learn from experience. This includes the necessary and sufficient conditions for generalization from a given finite training set to new observations. In this paper, we prove th ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract One of the central questions in statistical learning theory is to determine the conditions under which agents can learn from experience. This includes the necessary and sufficient conditions for generalization from a given finite training set to new observations. In this paper, we prove that algorithmic stability in the inference process is equivalent to uniform generalization across all parametric loss functions. We provide various interpretations of this result. For instance, a relationship is proved between stability and data processing, which reveals that algorithmic stability can be improved by post-processing the inferred hypothesis or by augmenting training examples with artificial noise prior to learning. In addition, we establish a relationship between algorithmic stability and the size of the observation space, which provides a formal justification for dimensionality reduction methods. Finally, we connect algorithmic stability to the size of the hypothesis space, which recovers the classical PAC result that the size (complexity) of the hypothesis space should be controlled in order to improve algorithmic stability and improve generalization.
Université d’Evry Val d’Essonne
"... PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform mod ..."
Abstract
- Add to MetaCart
PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first—to the best of our knowledge—PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC)
PAC-Bayesian Analysis of Co-clustering with Extensions to Matrix Tri-factorization, Graph Clustering, Pairwise Clustering, and Graphical Models
- JOURNAL OF MACHINE LEARNING RESEARCH
"... This paper promotes a novel point of view on unsupervised learning. We argue that the goal of unsupervised learning is to facilitate a solution of some higher level task, and that it should be evaluated in terms of its contribution to the solution of this task. We present an example of such an analy ..."
Abstract
- Add to MetaCart
This paper promotes a novel point of view on unsupervised learning. We argue that the goal of unsupervised learning is to facilitate a solution of some higher level task, and that it should be evaluated in terms of its contribution to the solution of this task. We present an example of such an analysis for the case of co-clustering, which is a widely used approach to the analysis of data matrices. This paper identifies two possible high-level tasks in matrix data analysis: discriminative prediction of the missing entries and estimation of the joint probability distribution of row and column variables. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that have not been part of previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved state-of-the-art performance
Project-Team WILLOW Models of Visual Object Recognition and Scene Understanding
"... c t i v it y e p o r t 2008 Table of contents ..."
1A Mathematical Theory of Learning
"... Abstract—In this paper, a mathematical theory of learning is proposed that has many parallels with information theory. We consider Vapnik’s General Setting of Learning in which the learning process is defined to be the act of selecting a hypothesis in response to a given training set. Such hypothesi ..."
Abstract
- Add to MetaCart
Abstract—In this paper, a mathematical theory of learning is proposed that has many parallels with information theory. We consider Vapnik’s General Setting of Learning in which the learning process is defined to be the act of selecting a hypothesis in response to a given training set. Such hypothesis can, for example, be a decision boundary in classification, a set of centroids in clus-tering, or a set of frequent item-sets in association rule mining. Depending on the hypothesis space and how the final hypothesis is selected, we show that a learning process can be assigned a numeric score, called learning capacity, which is analogous to Shannon’s channel capacity and satisfies similar interesting properties as well such as the data-processing inequality and the information-cannot-hurt inequality. In addition, learning capacity provides the tightest possible bound on the difference between true risk and empirical risk of the learning process for all loss functions that are parametrized by the chosen hypothesis. It is also shown that the notion of learning capacity equivalently quantifies how sensitive the choice of the final hypothesis is to a small perturbation in the training set. Consequently, algorithmic stability is both necessary and sufficient for generalization. While the theory does not rely on concentration inequalities, we finally show that analogs to classical results in learning theory using the Probably Approximately Correct (PAC) model can be immediately deduced using this theory, and conclude with information-theoretic bounds to learning capacity. I.
Majorizing codes and measures
"... An information theoretical interpretation of majorizing and minorizing measures is given. The expression logarithmic in the reciprocal of the measure of a ball is replaced by the number of bits needed to achieve desired precision in some convergent code. We also give a local version of the majorizin ..."
Abstract
- Add to MetaCart
(Show Context)
An information theoretical interpretation of majorizing and minorizing measures is given. The expression logarithmic in the reciprocal of the measure of a ball is replaced by the number of bits needed to achieve desired precision in some convergent code. We also give a local version of the majorizing bound. 1