Results 1 - 10
of
20
Query by Committee
, 1992
"... We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the high-low game and perceptron learning of another perceptr ..."
Abstract
-
Cited by 243 (2 self)
- Add to MetaCart
We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the high-low game and perceptron learning of another perceptron. As the number of queries goes to infinity, the committee algorithm yields asymptotically finite information gain. This leads to generalization error that decreases exponentially with the number of examples. This in marked contrast to learning from randomly chosen inputs, for which the information gain approaches zero and the generalization error decreases with a relatively slow inverse power law. We suggest that asymptotically finite information gain may be an important characteristic of good query algorithms.
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
- Machine Learning
, 1994
"... In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract
-
Cited by 98 (12 self)
- Add to MetaCart
In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1g-valued functions over an instance space X....
Rigorous learning curve bounds from statistical mechanics
- Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract
-
Cited by 52 (9 self)
- Add to MetaCart
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the Vapnik-Chervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problem-specific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distribution-independent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
On Weak Learning
- Journal of Computer and System Sciences
, 1995
"... This paper presents relationships between weak learning, weak prediction (where the probability of being correct is slightly larger than 50%), and consistency oracles (which decide whether or not a given set of examples is consistent with a concept in the class). Our main result is a simple polynomi ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
This paper presents relationships between weak learning, weak prediction (where the probability of being correct is slightly larger than 50%), and consistency oracles (which decide whether or not a given set of examples is consistent with a concept in the class). Our main result is a simple polynomial prediction algorithm which makes only a single query to a consistency oracle and whose predictions have a polynomial edge over random guessing. We compare this prediction algorithm with several of the standard prediction techniques, deriving an improved worst case bound on Gibbs Algorithm in the process. We use our algorithm to show that a concept class is polynomially learnable if and only if there is a polynomial probabilistic consistency oracle for the class. Since strong learning algorithms can be built from weak learning algorithms, our results also characterizes strong learnability.
How Well do Bayes Methods Work for On-Line Prediction of {±1} values?
- In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM
, 1992
"... We look at sequential classification and regression problems in which f\Sigma1g-labeled instances are given on-line, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
We look at sequential classification and regression problems in which f\Sigma1g-labeled instances are given on-line, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the performance of Bayes method for this task, as measured by the total number of mistakes for the classification problem, and by the total log loss (or information gain) for the regression problem. Our results are given by comparing the performance of Bayes method to the performance of a hypothetical "omniscient scientist" who is able to use extra information about the labeling process that would not be available in the standard learning protocol. The results show that Bayes methods perform only slightly worse than the omniscient scientist in many cases. These results generalize previous results of Haussler, Kearns and Schapire, and Opper and Haussler. 1 Introduction Several recent papers in...
Calculation of the Learning Curve of Bayes Optimal Classification Algorithm for Learning a Perceptron With Noise
- In Computational Learning Theory: Proceedings of the Fourth Annual Workshop
, 1991
"... The learning curve of Bayes optimal classification algorithm when learning a perceptron from noisy random training examples is calculated exactly in the limit of large training sample size and large instance space dimension using methods of statistical mechanics. It is shown that under certain assum ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
The learning curve of Bayes optimal classification algorithm when learning a perceptron from noisy random training examples is calculated exactly in the limit of large training sample size and large instance space dimension using methods of statistical mechanics. It is shown that under certain assumptions, in this "thermodynamic" limit, the probability of misclassification of Bayes optimal algorithm is less than that of a canonical stochastic learning algorithm, by a factor approaching p 2 as the ratio of number of training examples to instance space dimension grows. Exact asymptotic learning curves for both algorithms are derived for particular distributions. In addition, it is shown that the learning performance of Bayes optimal algorithm can be approximated by certain learning algorithms that use a neural net with a layer of hidden units to learn a perceptron. 1 Introduction Extending a line of research initiated by Elizabeth Gardner [Gar88, GD88], exceptional progress has been ...
On the Stochastic Complexity of Learning Realizable and Unrealizable Rules
, 1995
"... The problem of learning from examples in an average case setting is considered. Focusing on the stochastic complexity, an information theoretic quantity measuring the minimal description length of the data given a class of models, we find rigorous upper and lower bounds for this quantity under vario ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The problem of learning from examples in an average case setting is considered. Focusing on the stochastic complexity, an information theoretic quantity measuring the minimal description length of the data given a class of models, we find rigorous upper and lower bounds for this quantity under various conditions. For realizable problems, where the model class used is sufficiently rich to represent the function giving rise to the examples, we find tight upper and lower bounds for the stochastic complexity. In this case, bounds on the prediction error follow immediately using the methods of Haussler et al. (1994a). For unrealizable learning we find a tight upper bound only in the case of learning within a space of finite VC dimension. Moreover, we show in the latter case that the optimal method for prediction may not be the same as that for data compression, even in the limit of an infinite amount of training data, although the two problems (i.e. prediction and compression) are asymptoti...
Annealed Theories of Learning
- In J.-H
, 1995
"... We study annealed theories of learning boolean functions using a concept class of finite cardinality. The naive annealed theory can be used to derive a universal learning curve bound for zero temperature learning, similar to the inverse square root bound from the Vapnik-Chervonenkis theory. Tighter, ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We study annealed theories of learning boolean functions using a concept class of finite cardinality. The naive annealed theory can be used to derive a universal learning curve bound for zero temperature learning, similar to the inverse square root bound from the Vapnik-Chervonenkis theory. Tighter, nonuniversal learning curve bounds are also derived. A more refined annealed theory leads to still tighter bounds, which in some cases are very similar to results previously obtained using one-step replica symmetry breaking. 1. Introduction The annealed approximation 1 has proven to be an invaluable tool for studying the statistical mechanics of learning from examples. Previously it was found that the annealed approximation gave qualitatively correct results for several models of perceptrons learning realizable rules. 2 Because of its simplicity relative to the full quenched theory, the annealed approximation has since been used in studies of more complicated multilayer architectures. ...
Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework
, 1995
"... Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then c ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory.
Faithful Representations and Moments of Satisfaction: Probabilistic Methods in Learning and Logic
, 1998
"... ii To my wife, Ma'ayan, and my daughter, Shira. iii Acknowledgments Special thanks are due to: ffl Prof. Naftali Tishby for his help and guidance in carrying out this study, for the many fascinating discussions we had, and for the immense body of knowledge that I have absorbed from him during my stu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
ii To my wife, Ma'ayan, and my daughter, Shira. iii Acknowledgments Special thanks are due to: ffl Prof. Naftali Tishby for his help and guidance in carrying out this study, for the many fascinating discussions we had, and for the immense body of knowledge that I have absorbed from him during my studies.

