## Generalization error bounds using unlabeled data (2005)

### Cached

### Download Links

- [www-connex.lip6.fr]
- [www.cs.helsinki.fi]
- [www.cs.helsinki.fi]
- [www.cs.helsinki.fi]
- [www-poleia.lip6.fr]
- [www.cs.helsinki.fi]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Venue: | in Learning Theory: 18th Annual Conference on Learning Theory, COLT 2005 |

Citations: | 21 - 2 self |

### BibTeX

@INPROCEEDINGS{Kääriäinen05generalizationerror,

author = {Matti Kääriäinen},

title = {Generalization error bounds using unlabeled data},

booktitle = {in Learning Theory: 18th Annual Conference on Learning Theory, COLT 2005},

year = {2005},

pages = {127--142},

publisher = {COLT}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. We present two new methods for obtaining generalization error bounds in a semi-supervised setting. Both methods are based on approximating the disagreement probability of pairs of classifiers using unlabeled data. The first method works in the realizable case. It suggests how the ERM principle can be refined using unlabeled data and has provable optimality guarantees when the number of unlabeled examples is large. Furthermore, the technique extends easily to cover active learning. A downside is that the method is of little use in practice due to its limitation to the realizable case. The idea in our second method is to use unlabeled data to transform bounds for randomized classifiers into bounds for simpler deterministic classifiers. As a concrete example of how the general method works in practice, we apply it to a bound based on cross-validation. The result is a semi-supervised bound for classifiers learned based on all the labeled data. The bound is easy to implement and apply and should be tight whenever cross-validation makes sense. Applying the bound to SVMs on the MNIST benchmark data set gives results that suggest that the bound may be tight enough to be useful in practice. 1

### Citations

10365 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...ns a target function f0 for which Y = f(X) (always or at least with probability 1). More specifically, we show that we can improve on the best results obtainable for empirical risk minimization (ERM) =-=[5]-=- provided we have access to a sufficiently large sample of unlabeled examples. In our second method for obtaining semi-supervised generalization error bounds we drop the assumption of the existence of... |

1789 | A Theory of the Learnable
- Valiant
- 1984
(Show Context)
Citation Context ...earner in advance is hardly ever justifiable in practice. This limitation is not a problem of our setting only, but affects, e.g., all results obtained in the original PAC model introduced by Valiant =-=[11]-=-. In this section, we drop all assumptions about the existence of a target, which makes our results applicable in all situations covered by the semi-supervised learning model. Our bounds for the gener... |

1599 |
Making large-scale SVM learning practical
- Joachims
- 1998
(Show Context)
Citation Context ...and forgot the labels of the remaining 10 000 examples to get a set of unlabeled data. The only preprocessing was scaling the pixel intensities to [−1, 1]. As the learning algorithm, we used svmlight =-=[18]-=-, a standard implementation of the C-SVM learning algorithm. The algorithm is capable of solving binary problems only, so we transformed the original learning problem into ten 1 vs rest problems. That... |

1094 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...y f as follows: Choose an X according to PX, and let Y ′ = f(X). It is easy to see that fvote is the Bayes classifier for this problem and thus has the minimal probability of misclassifying (X, Y ′ ) =-=[15]-=-. By the definition of Y ′ , this probability is d(f,fvote). Now d(f,fvote) ≤ infg d(f,g) ≤ d(f,Y )=ɛ(f), since Y can be viewed as a potential choice of g. ⊓⊔ Combining this theorem with Corollary 2 g... |

280 | Rademacher and Gaussian complexities: Risk bounds and structural results
- Bartlett, Mendelson
(Show Context)
Citation Context ...ˆ d) are with high probability close to their true expectations (in our notation d). This yields an upper bound for sup {(d − ˆ d)(g ′ ,g) | g ′ ,g ∈ F0}, the quantity we are interested in. Following =-=[9]-=-, we define the Rademacher penalty Rm(H) of a class H of functions from X to {0, 1} as follows: � � � � m� � Rm(H) =sup� 1 � � σj(1 − 2h(Xn+j)) � h∈H �m � � . j=1 Here, the random elements Xn+j are in... |

83 |
On the exponential value of labeled samples
- Castelli, Cover
- 1995
(Show Context)
Citation Context ...upervised learning have received less attention, although some interesting results have been published recently [1, 2]. The value of unlabeled data to learning has been studied in restricted settings =-=[3, 4]-=-, but to our knowledge the general question of whether unlabeled data provably helps in classifier learning has not been answered. We prove that unlabeled data is useful in the realizable case, that i... |

65 | PAC-Bayesian Stochastic Model Selection
- McAllester
- 2003
(Show Context)
Citation Context ...The randomized classifier together with its bound thus plays the role the target function was in in the realizable case. Randomized bounds that can be used here include, e.g., the PAC-Bayesian bounds =-=[12]-=-, the recent bounds for ensembles of classifiers created by an online learning algorithm [13], and the progressive validation bound [14]. Test set bounds can be interpreted as a special case of this s... |

56 | Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. Doctoral dissertation
- Seeger
- 2003
(Show Context)
Citation Context ...ative bounds, but feel that it is safe to claim that they all would have been way above 100% on the multi-class learning problem at hand (for a survey on some of these bounds and their looseness, see =-=[19]-=-). Also note that the bound for the best fi underlying f is worse than the bound for ffinal. The intuitive explanation is that ffinal has an advantage because it is learned based on all the labeled da... |

53 | A PAC-style model for learning from labeled and unlabeled data
- Balcan, Blum
- 2004
(Show Context)
Citation Context ...ised learning algorithms that can be used in practice. The theoretical aspects of semi-supervised learning have received less attention, although some interesting results have been published recently =-=[1, 2]-=-. The value of unlabeled data to learning has been studied in restricted settings [3, 4], but to our knowledge the general question of whether unlabeled data provably helps in classifier learning has ... |

44 | Almost everywhere algorithmic stability and generalization error
- Kutin, Niyogi
- 2002
(Show Context)
Citation Context ...errors of a learned classifier are close to each other. If the algorithm is ERM, then training stability is also necessary and sufficient for successful generalization. For this line of research, see =-=[17]-=- and the references therein. The notions of algorithmic stability measure how much the error of a learned hypothesis (on a point or over the whole of X ) may change when the labeled learning sample is... |

42 | Beating the hold-out: bounds for k-fold and progressive cross-validation
- Blum, Kalai, et al.
- 1999
(Show Context)
Citation Context ...ds that can be used here include, e.g., the PAC-Bayesian bounds [12], the recent bounds for ensembles of classifiers created by an online learning algorithm [13], and the progressive validation bound =-=[14]-=-. Test set bounds can be interpreted as a special case of this setting in which the randomized classifier is actually deterministic. Also bagging and cross-validation can be used as bases for generali... |

37 |
Learning from a mixture of labeled and unlabeled examples with parametric side information
- Ratsaby, Venkatesh
- 1995
(Show Context)
Citation Context ...upervised learning have received less attention, although some interesting results have been published recently [1, 2]. The value of unlabeled data to learning has been studied in restricted settings =-=[3, 4]-=-, but to our knowledge the general question of whether unlabeled data provably helps in classifier learning has not been answered. We prove that unlabeled data is useful in the realizable case, that i... |

23 | Metric-based methods for adaptive model selection and regularization
- Schuurmans, Southey
- 2002
(Show Context)
Citation Context ... using the disagreement probability d(f,g) =P(f �= g) =P(f(X) �= g(X)) as a metric in the space of randomized classifiers. Variants of d have been used earlier as a basis for model selection criteria =-=[6, 7]-=-, in providing lower bounds and estimates of the variance of the error of a hypothesis produced by a learning algorithm in a co-validation setting [1], and as an example of a distance measure that can... |

23 | Improved risk tail bounds for on-line algorithms
- Cesa-Bianchi, Gentile
(Show Context)
Citation Context ... in in the realizable case. Randomized bounds that can be used here include, e.g., the PAC-Bayesian bounds [12], the recent bounds for ensembles of classifiers created by an online learning algorithm =-=[13]-=-, and the progressive validation bound [14]. Test set bounds can be interpreted as a special case of this setting in which the randomized classifier is actually deterministic. Also bagging and cross-v... |

15 | Extensions to Metric-Based Model Selection
- Bengio, Chapados
- 2003
(Show Context)
Citation Context ... using the disagreement probability d(f,g) =P(f �= g) =P(f(X) �= g(X)) as a metric in the space of randomized classifiers. Variants of d have been used earlier as a basis for model selection criteria =-=[6, 7]-=-, in providing lower bounds and estimates of the variance of the error of a hypothesis produced by a learning algorithm in a co-validation setting [1], and as an example of a distance measure that can... |

13 | Learning by distances
- Ben-David, Itai, et al.
- 1990
(Show Context)
Citation Context ... the variance of the error of a hypothesis produced by a learning algorithm in a co-validation setting [1], and as an example of a distance measure that can be used in the learning by distances model =-=[8]-=-. To our knowledge, using d insGeneralization Error Bounds Using Unlabeled Data 129 proving generalization error bounds is original to our work. The disagreement probability d is very natural in this ... |

4 |
Co-Validation: Using Model Disagreement to Validate Classification Algorithms. Neural Information Processing Systems
- Madani, Pennock, et al.
- 2004
(Show Context)
Citation Context ...ised learning algorithms that can be used in practice. The theoretical aspects of semi-supervised learning have received less attention, although some interesting results have been published recently =-=[1, 2]-=-. The value of unlabeled data to learning has been studied in restricted settings [3, 4], but to our knowledge the general question of whether unlabeled data provably helps in classifier learning has ... |

1 |
Relating the Rademacher and VC bounds
- Kääriäinen
- 2004
(Show Context)
Citation Context ...he bound behaves in the worst case as a function of m, one can resort to further upper bounds based on (upper bounds) for the VC dimension of {�g ′ �= g� | g ′ ,g ∈ F0} to get the following corollary =-=[9, 10]-=-. Corollary 1. Let ˆ f be the empirical center of F0 and let D be an upper bound for the (data-dependent) VC dimension of {�g ′ �= g� | g ′ ,g ∈ F0}. Thenwith probability at least 1 − δ (over the choi... |

1 |
Practical prediction theory for classification (2003) A tutorial presented at ICML 2003. Available at http://hunch.net/ ∼ jl/projects/ prediction bounds/tutorial/tutorial.pdf
- Langford
(Show Context)
Citation Context ... That is, let the set of classifiers underlying f be {f1,...,fk} and let Θf have uniform distribution in {1,...,k}. The following generalization error bound for f that builds on tight test set bounds =-=[16]-=- for the underlying classifiers fi is to our knowledge new. Theorem 7. Let f be the randomized classifier obtained by cross-validation as explained above. Then with probability at least 1−δ (over the ... |