## On the Generalization Ability of On-line Learning Algorithms (2001)

### Cached

### Download Links

- [www-2.cs.cmu.edu]
- [www.dicom.uninsubria.it]
- [books.nips.cc]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE Transactions on Information Theory |

Citations: | 137 - 8 self |

### BibTeX

@ARTICLE{Cesa-Bianchi01onthe,

author = {Nicolo Cesa-Bianchi and Alex Conconi and Claudio Gentile},

title = {On the Generalization Ability of On-line Learning Algorithms},

journal = {IEEE Transactions on Information Theory},

year = {2001},

volume = {50},

pages = {2050--2057}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

### Citations

9457 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...unds that in many cases are comparable or better than the best known bounds. 1 Introduction One of the main contributions of the recent statistical theories for regression and classification problems =-=[21, 19]-=- is the derivation of functionals of certain empirical quantities (such as the sample error or the sample margin) that provide uniform risk bounds for all the hypotheses in a certain class. This appro... |

2131 | Learning with Kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ... classical Perceptron algorithm [35], [36], [37] in its dual kernel form, as investigated in, e.g., [27], [38]. For an introduction to kernels in learning theory the reader is referred to [39], or to =-=[40]-=- for a more in-depth monography. Here, we just recall the following basic definitions. A kernel is a nonnegative e0 0 0 function satisfying | v =j where in the last inequality we used (4). Combining w... |

1021 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ... with the University of Milan, Italy. G. Gentile is with the University of Insubria, Italy. C2ED F*G We analyze learning algorithms within the framework of statistical pattern recognition (see, e.g., =-=[1]-=-). In this framework all the K examples are generated by independent draws from a fixed and unknown probability distribution #L on . This assumption allows us to view the training set as a statistical... |

979 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...sidering the classical Perceptron algorithm [35], [36], [37] in its dual kernel form, as investigated in, e.g., [27], [38]. For an introduction to kernels in learning theory the reader is referred to =-=[39]-=-, or to [40] for a more in-depth monography. Here, we just recall the following basic definitions. A kernel is a nonnegative e0 0 0 function satisfying | v =j where in the last inequality we used (4).... |

959 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its applications XVI(2):264{280
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...ability is taken with respect to the distribution of the training sample OeS . To achieve this goal, we can use the method of uniform convergence, whose study was pioneered by Vapnik and Chervonenkis =-=[2]-=- (see also [3], [4]). Let riskempA\ be the empirical risk of on a sample O+S , riskempA\Z2 6 | ~ D '(A P @ Q W Uniform convergence means that, for all probability distributions, the empirical risk of ... |

849 |
The perceptron: a probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...shold learners generate hypotheses of the form1 B 2 SGN HB , where 0 is a so-called weight vector & associated with hypothesis . We begin by considering the classical Perceptron algorithm [35], [36], =-=[37]-=- in its dual kernel form, as investigated in, e.g., [27], [38]. For an introduction to kernels in learning theory the reader is referred to [39], or to [40] for a more in-depth monography. Here, we ju... |

748 | Boosting the margin: a new explanation for the effectiveness of voting methods
- Schapire, Freund, et al.
- 1997
(Show Context)
Citation Context ... 6$,v M f !+ a" risk)B h riskempA\ p A O S p, 6 =j 6 v s tƒv W (2) | Prominent examples of this kind are the bounds for linearthreshold classifiers, S where depends on the margin of [11], [12], [13], =-=[14]-=-, and the bounds for Bayesian mixtures, where depends on the Kullback-Leibler divergence between the data-dependent mixture coefficients and the a priori coefficients [15]. Note that bounds of the for... |

693 | The weighted majority algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...s added to the training set then the algorithm needs be run again from scratch). On-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algorithm [14], and their many variants =-=[16, 6, 13, 10, 2, 9]-=-, are general methods for solving classification and regression problems that can be used in a fully incremental fashion. That is, they need (in most cases) a short time to process each new training e... |

681 | Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2:285{318
- Littlestone
- 1988
(Show Context)
Citation Context ...dom incremental (if new data is added to the training set then the algorithm needs be run again from scratch). On-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algorithm =-=[14]-=-, and their many variants [16, 6, 13, 10, 2, 9], are general methods for solving classification and regression problems that can be used in a fully incremental fashion. That is, they need (in most cas... |

661 |
Queries and concept learning
- Angluin
- 1988
(Show Context)
Citation Context ... for some known 0s1. The on-line algorithms we investigate are defined within a well-known mathematical model, which is a generalization of a learning model introduced by Littlestone [14] and Angluin =-=[1-=-]. Let a training sequence z t = ((x 1 ; y 1 ); : : : ; (x t ; y t )) 2 (X Y) t be fixed. In this learning model, an on-line algorithm processes the examples in z t one at a time in trials, generatin... |

526 |
Ridge Regression: Biased Estimation for Nonorthogonal Problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...y better than the bound in [20] whenever Ds=t is constant, which typically occurs when the data sequence is not linearly separable. As a second application, we consider the ridge regression algorithm =-=[12-=-] for square loss. Assume X = R n and Y = [ Y; +Y ]. This algorithm computes at the beginning of the i-th trial the vector w = w i 1 which minimizes a 2 jjwjj 2 2 + P i 1 j=1 1 2 (y j w x j ) 2 , whe... |

420 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1998
(Show Context)
Citation Context ...of the online algorithm and provides risk tail bounds that are sharper than those obtainable choosing, for instance, the hypothesis in the run that survived the longest. Helmbold, Warmuth, and others =-=[11, 6, 8]-=- showed that, without using any cross-validation sets, one can obtain expected risk bounds (as opposed to the more informative tail bounds) for a hypothesis randomly drawn among those generated during... |

320 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...s added to the training set then the algorithm needs be run again from scratch). On-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algorithm [14], and their many variants =-=[16, 6, 13, 10, 2, 9]-=-, are general methods for solving classification and regression problems that can be used in a fully incremental fashion. That is, they need (in most cases) a short time to process each new training e... |

292 |
Rozoner, “Theoretical foundations of the potential function method in pattern recognition learning
- Aizerman, Braverman, et al.
- 1964
(Show Context)
Citation Context ...here 0 is a so-called weight vector & associated with hypothesis . We begin by considering the classical Perceptron algorithm [35], [36], [37] in its dual kernel form, as investigated in, e.g., [27], =-=[38]-=-. For an introduction to kernels in learning theory the reader is referred to [39], or to [40] for a more in-depth monography. Here, we just recall the following basic definitions. A kernel is a nonne... |

272 | Rademacher and Gaussian Complexities Risk Bounds and Structural Results
- Bartlett, Mendelson
(Show Context)
Citation Context ...erm in (1) with the random " quantity S where is a sample statistic depending " on and is a universal constant. For S example, AOeS can be the empirical VC-entropy [7], [8], the Rademacher complexity =-=[9]-=-, or the maximum discrepancy [10] of the " class . In general, this approach is advantageous when the mean S of is significantly smaller % ‚& than | and when large deviations of are unlikely. In these... |

256 | Structural risk minimization over data-dependent hierarchies
- SHAWE-TAYLOR
- 1998
(Show Context)
Citation Context ...unds that in many cases are comparable or better than the best known bounds. 1 Introduction One of the main contributions of the recent statistical theories for regression and classification problems =-=[21, 19]-=- is the derivation of functionals of certain empirical quantities (such as the sample error or the sample margin) that provide uniform risk bounds for all the hypotheses in a certain class. This appro... |

252 |
principles of neurodynamics: perceptrons and the theory of brain mechanisms
- Rosenblatt
- 1962
(Show Context)
Citation Context ...minimizing algorithm is seldom incremental (if new data is added to the training set then the algorithm needs be run again from scratch). On-line learning algorithms, such as the Perceptron algorithm =-=[17]-=-, the Winnow algorithm [14], and their many variants [16, 6, 13, 10, 2, 9], are general methods for solving classification and regression problems that can be used in a fully incremental fashion. That... |

237 |
Weighted sums of certain dependent random variables
- Azuma
- 1967
(Show Context)
Citation Context ... 10 c D 5`N '( 10 c D P @ Q Z2 y W 10 A direct application of the Hoeffding-Azuma inequality (a generalization of Chernoff-Hoeffding bounds to sums of conditionally zero-mean bounded random variables =-=[32]-=-) to the random / 8W8W>WX S 0 D variables proves the lemma. We will be using this simple concentration result several times in the rest of this section. 6 | 6 | ~ D NT'( c risk c 10 D P` Q^ 10 D ~ D 7... |

179 | The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network
- Bartlett
- 1998
(Show Context)
Citation Context ...Kl m(n o 2 riskempgb c riskemp)B7W t riskemp) p ‚+p =j& =j ,v ‚+p | risk) p ‚+p kj&! | t where we applied (1) in the first and the last step. A different approach to uniform convergence, pioneered in =-=[6]-=-, replaces the square-root term in (1) with the random " quantity S where is a sample statistic depending " on and is a universal constant. For S example, AOeS can be the empirical VC-entropy [7], [8]... |

176 | Stability and generalization
- Bousquet, Elisseeff
- 2002
(Show Context)
Citation Context ...ich only hold for the hypotheses generated by learning algorithms satisfying certain properties. Examples along these lines are the notions of self-bounding learners [16], [17], algorithmic stability =-=[18]-=-, and algorithmic luckiness [19]. In this paper we follow a similar idea and develop a general framework for analyzing the risk of hypotheses generated by on-line learners, a specific class of learnin... |

160 | Universal prediction of individual sequences
- Feder, Gutman
- 1991
(Show Context)
Citation Context ...thus avoiding the sophisticated statistical tools required by risk analyses based on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], =-=[21]-=-, [22], [23], [24] for early references on the subject, and [25], [26], [27], [28], [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the s... |

136 |
Additive versus Exponentiated Gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...s added to the training set then the algorithm needs be run again from scratch). On-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algorithm [14], and their many variants =-=[16, 6, 13, 10, 2, 9]-=-, are general methods for solving classification and regression problems that can be used in a fully incremental fashion. That is, they need (in most cases) a short time to process each new training e... |

134 |
On convergence proofs on perceptrons
- Novikoff
- 1962
(Show Context)
Citation Context ...r-threshold learners generate hypotheses of the form1 B 2 SGN HB , where 0 is a so-called weight vector & associated with hypothesis . We begin by considering the classical Perceptron algorithm [35], =-=[36]-=-, [37] in its dual kernel form, as investigated in, e.g., [27], [38]. For an introduction to kernels in learning theory the reader is referred to [39], or to [40] for a more in-depth monography. Here,... |

119 | Empirical margin distributions and bounding the generalization error of combined classifiers
- Koltchinskii, Panchenko
(Show Context)
Citation Context ...m AO S p# kj 6$,v M f !+ a" risk)B h riskempA\ p A O S p, 6 =j 6 v s tƒv W (2) | Prominent examples of this kind are the bounds for linearthreshold classifiers, S where depends on the margin of [11], =-=[12]-=-, [13], [14], and the bounds for Bayesian mixtures, where depends on the Kullback-Leibler divergence between the data-dependent mixture coefficients and the a priori coefficients [15]. Note that bound... |

117 | Relative loss bounds for on-line density estimation with the exponential family of distributions
- Azoury, Warmuth
(Show Context)
Citation Context |

107 | A game of prediction with expert advice
- Vovk
- 1998
(Show Context)
Citation Context ...sophisticated statistical tools required by risk analyses based on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], [22], [23], =-=[24]-=- for early references on the subject, and [25], [26], [27], [28], [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample statistic go... |

107 |
Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow
- Littlestone
- 1991
(Show Context)
Citation Context ...ence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], [22], [23], [24] for early references on the subject, and [25], [26], [27], [28], [29], [30], =-=[31]-=- for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample statistic governing the risk for a specific on-line learner, the kernel Perceptron algorithm, ... |

87 |
From online to batch learning
- Littlestone
- 1989
(Show Context)
Citation Context ...ere no assumptions are made on the way the training sequence is generated, there are fewer results concerning how to use these algorithms to obtain hypotheses with small statistical risk. Littlestone =-=[15]-=- proposed a method for obtaining small risk hypotheses from a run of an arbitrary on-line algorithm by using a cross validation set to test each one of the hypotheses generated during the run. This me... |

83 | General convergence results for linear discriminant updates
- Grove, Littlestone, et al.
(Show Context)
Citation Context |

80 |
A sharp concentration inequality with applications. Random Structures Algorithms
- Boucheron, Lugosi, et al.
- 2000
(Show Context)
Citation Context ...d in [6], replaces the square-root term in (1) with the random " quantity S where is a sample statistic depending " on and is a universal constant. For S example, AOeS can be the empirical VC-entropy =-=[7]-=-, [8], the Rademacher complexity [9], or the maximum discrepancy [10] of the " class . In general, this approach is advantageous when the mean S of is significantly smaller % ‚& than | and when large ... |

79 |
The perceptron: A model for brain functioning
- Block
- 1962
(Show Context)
Citation Context ... linear-threshold learners generate hypotheses of the form1 B 2 SGN HB , where 0 is a so-called weight vector & associated with hypothesis . We begin by considering the classical Perceptron algorithm =-=[35]-=-, [36], [37] in its dual kernel form, as investigated in, e.g., [27], [38]. For an introduction to kernels in learning theory the reader is referred to [39], or to [40] for a more in-depth monography.... |

77 | Model selection and error estimation
- Bartlett, Boucheron, et al.
- 1999
(Show Context)
Citation Context ...ntity S where is a sample statistic depending " on and is a universal constant. For S example, AOeS can be the empirical VC-entropy [7], [8], the Rademacher complexity [9], or the maximum discrepancy =-=[10]-=- of the " class . In general, this approach is advantageous when the mean S of is significantly smaller % ‚& than | and when large deviations of are unlikely. In these cases such “data-dependent” unif... |

75 | Sequential prediction of individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context ...voiding the sophisticated statistical tools required by risk analyses based on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], =-=[22]-=-, [23], [24] for early references on the subject, and [25], [26], [27], [28], [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample ... |

73 | Tracking the best disjunction
- Auer, Warmuth
(Show Context)
Citation Context ...sk analyses based on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], [22], [23], [24] for early references on the subject, and =-=[25]-=-, [26], [27], [28], [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample statistic governing the risk for a specific on-line learne... |

64 |
The robustness of the p-norm algorithms
- Gentile
(Show Context)
Citation Context |

60 | A second-order perceptron algorithm
- Cesa-Bianchi, Conconi, et al.
- 2002
(Show Context)
Citation Context ...lyses based on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], [22], [23], [24] for early references on the subject, and [25], =-=[26]-=-, [27], [28], [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample statistic governing the risk for a specific on-line learner, the... |

50 | On weak learning
- Helmbold, Warmuth
- 1995
(Show Context)
Citation Context ...of the online algorithm and provides risk tail bounds that are sharper than those obtainable choosing, for instance, the hypothesis in the run that survived the longest. Helmbold, Warmuth, and others =-=[11, 6, 8]-=- showed that, without using any cross-validation sets, one can obtain expected risk bounds (as opposed to the more informative tail bounds) for a hypothesis randomly drawn among those generated during... |

39 | Beating the holdout: Bounds for k-fold and progressive cross-validation
- Blum, Kalai, et al.
- 1999
(Show Context)
Citation Context ...rm is m(z t )=t, where t is the size of the whole set of examples available to the learning algorithm (i.e., training set plus validation set in Littlestone's paper). Similar observations are made in =-=[4]-=-, though the analysis there does actually refer only to randomized hypotheses with 0-1 loss (namely, to absolute loss). Let us define the penalized risk estimate of hypothesis h i by m i t i + cs(t i)... |

38 | Linear hinge loss and average margin
- Gentile, Warmuth
- 1998
(Show Context)
Citation Context ...on uniform convergence. We borrow results from the literature on prediction of individual sequences (see, e.g., [20], [21], [22], [23], [24] for early references on the subject, and [25], [26], [27], =-=[28]-=-, [29], [30], [31] for specific work on the pattern classification problem). Based on strong pointwise bounds on the sample statistic governing the risk for a specific on-line learner, the kernel Perc... |

29 |
Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures
- Vapnik
- 1989
(Show Context)
Citation Context ...en with respect to the distribution of the training sample OeS . To achieve this goal, we can use the method of uniform convergence, whose study was pioneered by Vapnik and Chervonenkis [2] (see also =-=[3]-=-, [4]). Let riskempA\ be the empirical risk of on a sample O+S , riskempA\Z2 6 | ~ D '(A P @ Q W Uniform convergence means that, for all probability distributions, the empirical risk of is, with high ... |

28 | Generalization error bounds for Bayesian mixture algorithms
- Meir, Zhang
- 2003
(Show Context)
Citation Context ... margin of [11], [12], [13], [14], and the bounds for Bayesian mixtures, where depends on the Kullback-Leibler divergence between the data-dependent mixture coefficients and the a priori coefficients =-=[15]-=-. Note that bounds of the form (2) leave open the algorithmic problem of finding the a" hypothesis optimizing the tradeoff between the riskemp)B terms and YOeS . ) The techniques based on uniform conv... |

26 | Algorithmic luckiness
- Herbrich, Williamson
- 2003
(Show Context)
Citation Context ... generated by learning algorithms satisfying certain properties. Examples along these lines are the notions of self-bounding learners [16], [17], algorithmic stability [18], and algorithmic luckiness =-=[19]-=-. In this paper we follow a similar idea and develop a general framework for analyzing the risk of hypotheses generated by on-line learners, a specific class of learning algorithms. Exploiting certain... |

24 | An improved predictive accuracy bound for averaging classifiers
- Langford, Seeger, et al.
- 2001
(Show Context)
Citation Context ... p# kj 6$,v M f !+ a" risk)B h riskempA\ p A O S p, 6 =j 6 v s tƒv W (2) | Prominent examples of this kind are the bounds for linearthreshold classifiers, S where depends on the margin of [11], [12], =-=[13]-=-, [14], and the bounds for Bayesian mixtures, where depends on the Kullback-Leibler divergence between the data-dependent mixture coefficients and the a priori coefficients [15]. Note that bounds of t... |

23 | Self bounding learning algorithms
- Freund
- 1998
(Show Context)
Citation Context ... bounds yields statements which only hold for the hypotheses generated by learning algorithms satisfying certain properties. Examples along these lines are the notions of self-bounding learners [16], =-=[17]-=-, algorithmic stability [18], and algorithmic luckiness [19]. In this paper we follow a similar idea and develop a general framework for analyzing the risk of hypotheses generated by on-line learners,... |

17 | Relative expected instantaneous loss bounds - Forster, Warmuth |

17 |
Competitive On-line Linear Regression
- Vovk
- 1997
(Show Context)
Citation Context ... (x) = Y if x Y and Y (x) = x if Y x Y . The losses 1 2 (y i h i 1 (x i )) 2 are thus bounded by 2 Y 2 . We can apply Theorem 2 to the bound on the cumulative loss M for ridge regression (see [22,=-= 2]-=-) and obtain that, with probability at least 1swith respect to the draw of the training sample Z t , the risk er(H) of the average hypothesis estimator H is at most 1 t a 2 jjujj 2 2 +M(u;Z t ) + 2 Y ... |

15 | Microchoice bounds and self bounding learning algorithms
- Langford, A
- 2003
(Show Context)
Citation Context ...t risk bounds yields statements which only hold for the hypotheses generated by learning algorithms satisfying certain properties. Examples along these lines are the notions of self-bounding learners =-=[16]-=-, [17], algorithmic stability [18], and algorithmic luckiness [19]. In this paper we follow a similar idea and develop a general framework for analyzing the risk of hypotheses generated by on-line lea... |

14 |
The complexity of learning according to two models of a drifting environment
- Long
- 1999
(Show Context)
Citation Context ... " S c c / c D / S S c D / c D c S IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. ?, NO. ?, ? 2004 2 | , m(n o G riskempA\H5 risk)B G h =j& 6 v ‚Tp t v | (1) holds (for a proof of this result see e.g. =-=[5]-=-). Uniform convergence implies " that can be learned by the empirical risk minimizer, i.e., by the algorithm returning the hypothesis b c Once we have a uniform convergence result like (1), the risk a... |

13 | Data-dependent marginbased generalization bounds for classification
- Antos, Kégl, et al.
- 2002
(Show Context)
Citation Context ...he form AO S p# kj 6$,v M f !+ a" risk)B h riskempA\ p A O S p, 6 =j 6 v s tƒv W (2) | Prominent examples of this kind are the bounds for linearthreshold classifiers, S where depends on the margin of =-=[11]-=-, [12], [13], [14], and the bounds for Bayesian mixtures, where depends on the Kullback-Leibler divergence between the data-dependent mixture coefficients and the a priori coefficients [15]. Note that... |

9 |
On the generalization of soft margin algorithms
- Shawe-Taylor, Cristianini
(Show Context)
Citation Context ... (p 1) Ds(u; Z t ) + 5 r 1 t ln 2(t + 1) (4) for anys> 0 and for any u such that jjujj p=(p 1) 1. The margin-based quantity Ds(u; z t ) = P t i=1 maxf0; 1 y i u x i = g is called soft margin in [20=-=]-=- and accounts for the distribution of margin values achieved by the examples in z t with respect to hyperplane u. Traditional data-dependent bounds using uniform convergence methods (e.g., [19]) are t... |