Results 1 - 10
of
46
Boosting a Weak Learning Algorithm By Majority
, 1995
"... We present an algorithm for improving the accuracy of algorithms for learning binary concepts. The improvement is achieved by combining a large number of hypotheses, each of which is generated by training the given learning algorithm on a different set of examples. Our algorithm is based on ideas pr ..."
Abstract
-
Cited by 358 (15 self)
- Add to MetaCart
We present an algorithm for improving the accuracy of algorithms for learning binary concepts. The improvement is achieved by combining a large number of hypotheses, each of which is generated by training the given learning algorithm on a different set of examples. Our algorithm is based on ideas presented by Schapire in his paper "The strength of weak learnability", and represents an improvement over his results. The analysis of our algorithm provides general upper bounds on the resources required for learning in Valiant's polynomial PAC learning framework, which are the best general upper bounds known today. We show that the number of hypotheses that are combined by our algorithm is the smallest number possible. Other outcomes of our analysis are results regarding the representational power of threshold circuits, the relation between learnability and compression, and a method for parallelizing PAC learning algorithms. We provide extensions of our algorithms to cases in which the conc...
How to Use Expert Advice
- JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract
-
Cited by 267 (60 self)
- Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Learning Conjunctions of Horn Clauses
- Machine Learning
, 1992
"... An algorithm is presented for learning the class of Boolean formulas that are expressible as conjunctions of Horn clauses. (A Horn clause is a disjunction of literals, all but at most one of which is a negated variable.) The algorithm uses equivalence queries and membership queries to produce a form ..."
Abstract
-
Cited by 101 (14 self)
- Add to MetaCart
An algorithm is presented for learning the class of Boolean formulas that are expressible as conjunctions of Horn clauses. (A Horn clause is a disjunction of literals, all but at most one of which is a negated variable.) The algorithm uses equivalence queries and membership queries to produce a formula that is logically equivalent to the unknown formula to be learned. The amount of time used by the algorithm is polynomial in the number of variables and the number of clauses in the unknown formula. Keywords: propositional Horn sentences, equivalence queries, membership queries, exact identification, polynomial time learning Running head: Learning Conjunctions of Horn Clauses 1 The Problem Valiant (1984) introduced the distribution-free or "PAC" criterion for concept learning and focused attention on the question of what classes of Boolean formulas can be learned in polynomial time with respect to this criterion. He gave a polynomial-time algorithm to learn k-CNF or k-DNF formulas and...
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
- Machine Learning
, 1994
"... In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract
-
Cited by 98 (12 self)
- Add to MetaCart
In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1g-valued functions over an instance space X....
Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension
, 1992
"... : Let V ` f0; 1g n have Vapnik-Chervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
: Let V ` f0; 1g n have Vapnik-Chervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn=k) log(n=k)) d . This new result has applications in the theory of empirical processes. 1 The author gratefully acknowledges the support of the Mathematical Sciences Research Institute at UC Berkeley and ONR grant N00014-91-J-1162. 1 1 Statement of Results Let n be natural number greater than zero. Let V ` f0; 1g n . For a sequence of indices I = (i 1 ; . . . ; i k ), with 1 i j n, let V j I denote the projection of V onto I, i.e. V j I = f(v i 1 ; . . . ; v i k ) : (v 1 ; . . . ; v n ) 2 V g: If V j I = f0; 1g k then we say that V shatters the index sequence I. The Vapnik-Chervonenkis dimension of V is the size of the longest index sequence I that is shattered by V [VC71] (t...
On-line portfolio selection using multiplicative updates
- Mathematical Finance
, 1998
"... We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algo ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algorithm is very simple to implement and requires only constant storage and computing time per stock ineach trading period. We tested the performance of our algorithm on real stock data from the New York Stock Exchange accumulated during a 22-year period. On this data, our algorithm clearly outperforms the best single stock aswell as Cover's universal portfolio selection algorithm. We also present results for the situation in which the We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio investment strategy. The algorithm employsamultiplicative update rule derived using a framework introduced by Kivinen and Warmuth [20]. Our algorithm is very simple to implement and its time and storage requirements grow linearly in the number of stocks.
Sample compression, learnability, and the Vapnik-Chervonenkis dimension
- MACHINE LEARNING
, 1995
"... Within the framework of pac-learning, we explore the learnability of concepts from samples using the paradigm of sample compression schemes. A sample compression scheme of size k for a concept class C ` 2 X consists of a compression function and a reconstruction function. The compression function r ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
Within the framework of pac-learning, we explore the learnability of concepts from samples using the paradigm of sample compression schemes. A sample compression scheme of size k for a concept class C ` 2 X consists of a compression function and a reconstruction function. The compression function receives a finite sample set consistent with some concept in C and chooses a subset of k examples as the compression set. The reconstruction function forms a hypothesis on X from a compression set of k examples. For any sample set of a concept in C the compression set produced by the compression function must lead to a hypothesis consistent with the whole original sample set when it is fed to the reconstruction function. We demonstrate that the existence of a sample compression scheme of fixed-size for a class C is sufficient to ensure that the class C is pac-learnable. Previous work has shown that a class is pac-learnable if and only if the Vapnik-Chervonenkis (VC) dimension of the class i...
Tracking drifting concepts by minimizing disagreements
- Machine Learning
, 1994
"... Abstract. In this paper we consider the problem of tracking a subset of a domain (called the target) which changes gradually over time. A single (unknown) probability distribution over the domain is used to generate random examples for the learning algorithm and measure the speed at which the target ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
Abstract. In this paper we consider the problem of tracking a subset of a domain (called the target) which changes gradually over time. A single (unknown) probability distribution over the domain is used to generate random examples for the learning algorithm and measure the speed at which the target changes. Clearly, the more rapidly the target moves, the harder it is for the algorithm to maintain a good approximation of the target. Therefore we evaluate algorithms based on how much movement of the target can be tolerated between examples while predicting with accuracy e. Furthermore, the complexity of the class 7-/of possible targets, as measured by d, its VC-dimension, also effects the difficulty of tracking the target concept. We show that if the problem of minimizing the number of disagreements with a sample from among concepts in a class 7 { can be approximated to within a factor k, then there is a simple tracking algorithm for 7-t which can achieve a probability e of making a mistake if the target movement rate is at most a constant times e2/(k(d + k) In 1), where d is the Vapnik-Chervonenkis dimension of 7-t. Also, we show that if 7- / is properly PAC-learnable, then there is an efficient (randomized) algorithm that with high probability approximately minimizes disagreements to within a factor of 7d + 1, yielding an efficient tracking algorithm for 7-I which tolerates drift rates up to a constant times e2/(d 2 In ). In addition, we prove complementary results for the classes of halfspaces and axisaligned hyperrectangles showing that the maximum rate of drift that any algorithm (even with unlimited computational power) can tolerate is a constant times e2/d.
Efficient Learning of Typical Finite Automata from Random Walks
, 1997
"... This paper describes new and efficient algorithms for learning deterministic finite automata. Our approach is primarily distinguished by two features: (1) the adoption of an average-case setting to model the ``typical'' labeling of a finite automaton, while retaining a worst-case model for the under ..."
Abstract
-
Cited by 44 (9 self)
- Add to MetaCart
This paper describes new and efficient algorithms for learning deterministic finite automata. Our approach is primarily distinguished by two features: (1) the adoption of an average-case setting to model the ``typical'' labeling of a finite automaton, while retaining a worst-case model for the underlying graph of the automaton, along with (2) a learning model in which the learner is not provided with the means to experiment with the machine, but rather must learn solely by observing the automaton's output behavior on a random input sequence. The main contribution of this paper is in presenting the first efficient algorithms for learning nontrivial classes of automata in an entirely passive learning model. We adopt an on-line learning model in which the learner is asked to predict the output of the next state, given the next symbol of the random input sequence; the goal of the learner is to make as few prediction mistakes as possible. Assuming the learner has a means of resetting the target machine to a fixed start state, we first present an efficient algorithm that
Approximating Hyper-Rectangles: Learning and Pseudo-random Sets
- Journal of Computer and System Sciences
, 1997
"... The PAC learning of rectangles has been studied because they have been found experimentally to yield excellent hypotheses for several applied learning problems. Also, pseudorandom sets for rectangles have been actively studied recently because (i) they are a subproblem common to the derandomization ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
The PAC learning of rectangles has been studied because they have been found experimentally to yield excellent hypotheses for several applied learning problems. Also, pseudorandom sets for rectangles have been actively studied recently because (i) they are a subproblem common to the derandomization of depth-2 (DNF) circuits and derandomizing Randomized Logspace, and (ii) they approximate the distribution of n independent multivalued random variables. We present improved upper bounds for a class of such problems of "approximating" high-dimensional rectangles that arise in PAC learning and pseudorandomness. Key words and phrases. Rectangles, machine learning, PAC learning, derandomization, pseudorandomness, multiple-instance learning, explicit constructions, Ramsey graphs, random graphs, sample complexity, approximations of distributions. 2 1 Introduction A basic common theme of a large part of PAC learning and derandomization/computational pseudorandomness is to "approximate" a stru...

