Results 1 - 10
of
12
Transductive Rademacher complexity and its applications
- in Proceedings of COLT-07, 20th Annual Conference on Learning Theory
, 2007
"... Abstract We develop a technique for deriving data-dependent error bounds for transductive learning algorithms based on transductive Rademacher complexity. Our technique is based on a novel general error bound for transduction in terms of transductive Rademacher complexity, together with a novel bou ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Abstract We develop a technique for deriving data-dependent error bounds for transductive learning algorithms based on transductive Rademacher complexity. Our technique is based on a novel general error bound for transduction in terms of transductive Rademacher complexity, together with a novel bounding technique for Rademacher averages for particular algorithms, in terms of their "unlabeled-labeled" representation. This technique is relevant to many advanced graph-based transductive algorithms and we demonstrate its effectiveness by deriving error bounds to three well known algorithms. Finally, we present a new PAC-Bayesian bound for mixtures of transductive algorithms based on our Rademacher bounds.
Combining PAC-Bayesian and Generic Chaining Bounds
, 2007
"... There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds.
Mixability in statistical learning
- In Advances in Neural Information Processing Systems
"... Statistical learning and sequential prediction are two different but related for-malisms to study the quality of predictions. Mapping out their relations and trans-ferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequent ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Statistical learning and sequential prediction are two different but related for-malisms to study the quality of predictions. Mapping out their relations and trans-ferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixa-bility of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
Model selection type aggregation with better variance control or Fast learning rates in statistical inference through aggregation
, 2008
"... ..."
Choice of V for V -Fold Cross-Validation in Least-Squares Density Estimation
, 2016
"... Abstract This paper studies V -fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for V -fol ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract This paper studies V -fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for V -fold cross-validation and its bias-corrected version (V -fold penalization). In particular, this result implies that V -fold penalization is asymptotically optimal in the nonparametric case. Then, we compute the variance of V -fold cross-validation and related criteria, as well as the variance of key quantities for model selection performance. We show that these variances depend on V like 1 + 4/(V − 1), at least in some particular cases, suggesting that the performance increases much from V = 2 to V = 5 or 10, and then is almost constant. Overall, this can explain the common advice to take V = 5 -at least in our setting and when the computational power is limited-, as supported by some simulation experiments. An oracle inequality and exact formulas for the variance are also proved for Monte-Carlo cross-validation, also known as repeated cross-validation, where the parameter V is replaced by the number B of random splits of the data.
Mixability in Statistical Learning Tim
"... Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequentia ..."
Abstract
- Add to MetaCart
Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
REQUIREMENTS
"... This thesis would be impossible without a wise guidance of my advisor, Prof. Ran El-Yaniv. The path to the results presented in this thesis was long and I thank Ran for never loosing the faith in the final success. In both peaceful and stressful times, Ran constantly supported and navigated me towar ..."
Abstract
- Add to MetaCart
This thesis would be impossible without a wise guidance of my advisor, Prof. Ran El-Yaniv. The path to the results presented in this thesis was long and I thank Ran for never loosing the faith in the final success. In both peaceful and stressful times, Ran constantly supported and navigated me towards stronger results. Many thanks go to my coauthors during the thesis period: Ron
Randomized estimators, Statistical learning
, 2009
"... ABSTRACT: We consider the problem of predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. When the input distribution is known, there already exists an algor ..."
Abstract
- Add to MetaCart
ABSTRACT: We consider the problem of predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. When the input distribution is known, there already exists an algorithm having an expected excess risk of order d/n, where n is the size of the training data. Without this strong assumption, standard results often contain a multiplicative log n factor, and require some additional assumptions like uniform boundedness of the d-dimensional input representation and exponential moments of the output. This work provides new risk bounds for the ridge estimator and the ordinary least squares estimator, and their variants. It also provides shrinkage procedures with convergence rate d/n (i.e., without the logarithmic factor) in expectation and in deviations, under various assumptions. The key common surprising factor of these results is the absence of exponential moment condition on the output distribution while achieving exponential deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated differences of losses. Finally, we show that some of these results are not particular to the