Results 1  10
of
25
Improved Boosting Algorithms Using Confidencerated Predictions
 MACHINE LEARNING
, 1999
"... We describe several improvements to Freund and Schapire’s AdaBoost boosting algorithm, particularly in a setting in which hypotheses may assign confidences to each of their predictions. We give a simplified analysis of AdaBoost in this setting, and we show how this analysis can be used to find impr ..."
Abstract

Cited by 698 (26 self)
 Add to MetaCart
We describe several improvements to Freund and Schapire’s AdaBoost boosting algorithm, particularly in a setting in which hypotheses may assign confidences to each of their predictions. We give a simplified analysis of AdaBoost in this setting, and we show how this analysis can be used to find improved parameter settings as well as a refined criterion for training weak hypotheses. We give a specific method for assigning confidences to the predictions of decision trees, a method closely related to one used by Quinlan. This method also suggests a technique for growing decision trees which turns out to be identical to one proposed by Kearns and Mansour. We focus next on how to apply the new boosting algorithms to multiclass classification problems, particularly to the multilabel case in which each example may belong to more than one class. We give two boosting methods for this problem, plus a third method based on output coding. One of these leads to a new method for handling the singlelabel case which is simpler but as effective as techniques suggested by Freund and Schapire. Finally, we give some experimental results comparing a few of the algorithms discussed in this paper.
Scalesensitive Dimensions, Uniform Convergence, and Learnability
, 1997
"... Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions enjoy ..."
Abstract

Cited by 208 (1 self)
 Add to MetaCart
Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions enjoying such a property are also known as uniform GlivenkoCantelli classes. In this paper we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform GlivenkoCantelli classes. Our characterization yields Dudley, Gin'e, and Zinn's previous characterization as a corollary. Furthermore, it is the first based on a simple combinatorial quantity generalizing the VapnikChervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or "agnostic") framework. Furthermore, we show a characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire. These results show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.
The Sample Complexity of Pattern Classification With Neural Networks: The Size of the Weights is More Important Than the Size of the Network
, 1997
"... Sample complexity results from computational learning theory, when applied to neural network learning for pattern classification problems, suggest that for good generalization performance the number of training examples should grow at least linearly with the number of adjustable parameters in the ne ..."
Abstract

Cited by 177 (15 self)
 Add to MetaCart
Sample complexity results from computational learning theory, when applied to neural network learning for pattern classification problems, suggest that for good generalization performance the number of training examples should grow at least linearly with the number of adjustable parameters in the network. Results in this paper show that if a large neural network is used for a pattern classification problem and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. For example, consider a twolayer feedforward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A and the input dimension is n. We show that the misclassification probability is no more than a certain error estimate (that is related to squared error on the training set) plus A³ p (log n)=m (ignori...
Sphere Packing Numbers for Subsets of the Boolean nCube with Bounded VapnikChervonenkis Dimension
, 1992
"... : Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn ..."
Abstract

Cited by 93 (4 self)
 Add to MetaCart
: Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn=k) log(n=k)) d . This new result has applications in the theory of empirical processes. 1 The author gratefully acknowledges the support of the Mathematical Sciences Research Institute at UC Berkeley and ONR grant N0001491J1162. 1 1 Statement of Results Let n be natural number greater than zero. Let V ` f0; 1g n . For a sequence of indices I = (i 1 ; . . . ; i k ), with 1 i j n, let V j I denote the projection of V onto I, i.e. V j I = f(v i 1 ; . . . ; v i k ) : (v 1 ; . . . ; v n ) 2 V g: If V j I = f0; 1g k then we say that V shatters the index sequence I. The VapnikChervonenkis dimension of V is the size of the longest index sequence I that is shattered by V [VC71] (t...
Combining Discriminant Models with new MultiClass SVMs
, 2000
"... The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlati ..."
Abstract

Cited by 39 (10 self)
 Add to MetaCart
The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlation among the errors, availability of large training sets, possibility to rerun the training procedure an arbitrary number of times, etc.). In contrast, the practitioner who has to make a decision is frequently faced with the dicult problem of combining a given set of pretrained classiers, with highly correlated errors, using only a small training sample. Overtting is then the main risk, which cannot be overcome but with a strict complexity control of the combiner selected. This suggests that SVMs, which implement the SRM inductive principle, should be well suited for these dicult situations. Investigating this idea, we introduce a new family of multiclass SVMs and assess them as ensemble methods on a realworld problem. This task, protein secondary structure prediction, is an open problem in biocomputing for which model combination appears to be an issue of central importance. Experimental evidence highlights the gain in quality resulting from combining some of the most widely used prediction methods with our SVMs rather than with the ensemble methods traditionally used in the eld. The gain is increased when the outputs of the combiners are postprocessed with a simple DP algorithm.
Combining protein secondary structure prediction models with ensemble methods of optimal complexity
, 2004
"... ..."
Characterizations of learnability for classes of {0,..., n}valued functions
, 1997
"... We investigate the PAC learnability of classes of {0,..., n}valued functions (n <1). For n = 1 it is known that the niteness of the VapnikChervonenkis dimension is necessary and sufficient for learning. For n>1 several generalizations of the VCdimension, each yielding a distinct characterization ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
We investigate the PAC learnability of classes of {0,..., n}valued functions (n <1). For n = 1 it is known that the niteness of the VapnikChervonenkis dimension is necessary and sufficient for learning. For n>1 several generalizations of the VCdimension, each yielding a distinct characterization of learnability, have been proposed by anumber of researchers. In this paper we present a general scheme for extending the VCdimension to the case n>1. Our scheme defines a wide variety of notions of dimension in which all these variants of the VCdimension, previously introduced in the context of learning, appear as special cases. Our main result is a simple condition characterizing the set of notions of dimension whose fiiteness is necessary and sufficient for learning. This provides a variety of new tools for determining the learnability of a class of multivalued functions. Our characterization is also shown to hold in the "robust" variant of PAC model and for any "reasonable" loss function.
Valid generalisation from approximate interpolation
 Combinatorics, Probability and Computing
, 1994
"... Let H and C be sets of functions from domain X to ℜ. We say that H validly generalises C from approximate interpolation if and only if for each η> 0 and ɛ, δ ∈ (0, 1) there is m0(η, ɛ, δ) such that for any function t ∈ C and any probability distribution P on X, if m ≥ m0 then with P mprobability at ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
Let H and C be sets of functions from domain X to ℜ. We say that H validly generalises C from approximate interpolation if and only if for each η> 0 and ɛ, δ ∈ (0, 1) there is m0(η, ɛ, δ) such that for any function t ∈ C and any probability distribution P on X, if m ≥ m0 then with P mprobability at least 1 − δ, a sample x = (x1, x2,..., xm) ∈ X m satisfies ∀h ∈ H, h(xi) − t(xi)  < η, (1 ≤ i ≤ m) = ⇒ P({x: h(x) − t(x)  ≥ η}) < ɛ. We find conditions that are necessary and sufficient for H to validly generalise C from approximate interpolation, and we obtain bounds on the sample length m0(η, ɛ, δ) in terms of various parameters describing the expressive power of H. 1
A Theory for MemoryBased Learning
 Machine Learning
, 1994
"... A memorybased learning system is an extended memory management system that decomposes the input space either statically or dynamically into subregions for the purpose of storing and retrieving functional information. The main generalization techniques employed by memorybased learning systems are t ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
A memorybased learning system is an extended memory management system that decomposes the input space either statically or dynamically into subregions for the purpose of storing and retrieving functional information. The main generalization techniques employed by memorybased learning systems are the nearestneighbor search, space decomposition techniques, and clustering. Research on memorybased learning is still in its early stage. In particular, there are very few rigorous theoretical results regarding memory requirement, sample size, expected performance, and computational complexity. In this paper, we propose a model for memorybased learning and use it to analyze several methods fflcovering, hashing, clustering, treestructured clustering, and receptivefieldsfor learning smooth functions. The sample size and system complexity are derived for each method. Our model is built upon the generalized PAC learning model of Haussler (Haussler, 1989) and is closely related to the method of vector quantization in data compression. Our main result is that we can build memorybased learning systems using new clustering storage in typical situations.