Results 1 - 10
of
23
Efficient noise-tolerant learning from statistical queries
- JOURNAL OF THE ACM
, 1998
"... In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from stat ..."
Abstract
-
Cited by 248 (6 self)
- Add to MetaCart
In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from statistical queries. Intuitively, in this model, a learning algorithm is forbidden to examine individual examples of the unknown target function, but is given access to an oracle providing estimates of probabilities over the sample space of random examples. One of our main results shows that any class of functions learnable from statistical queries is in fact learnable with classification noise in Valiant’s model, with a noise rate approaching the information-theoretic barrier of 1/2. We then demonstrate the generality of the statistical query model, showing that practically every class learnable in Valiant’s model and its variants can also be learned in the new model (and thus can be learned in the presence of noise). A notable exception to this statement is the class of parity functions, which we prove is not learnable from statistical queries, and for which no noise-tolerant algorithm is known.
A Polynomial-time Algorithm for Learning Noisy Linear Threshold Functions
, 1996
"... In this paper we consider the problem of learning a linear threshold function (a halfspace in n dimensions, also called a "perceptron"). Methods for solving this problem generally fall into two categories. In the absence of noise, this problem can be formulated as a Linear Program and solved in p ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
In this paper we consider the problem of learning a linear threshold function (a halfspace in n dimensions, also called a "perceptron"). Methods for solving this problem generally fall into two categories. In the absence of noise, this problem can be formulated as a Linear Program and solved in polynomial time with the Ellipsoid Algorithm or Interior Point methods. Alternatively, simple greedy algorithms such as the Perceptron Algorithm are often used in practice and have certain provable noise-tolerance properties; but, their running time depends on a separation parameter, which quanties the amount of "wiggle room" available for a solution, and can be exponential in the description length of the input. In this paper, we show how simple greedy methods can be used to nd weak hypotheses (hypotheses that correctly classify noticeably more than half of the examples) in polynomial time, without dependence on any separation parameter. Suitably combining these hypotheses results in a polynomial-time algorithm for learning linear threshold functions in the PAC model in the presence of random classification noise. (Also, a polynomial-time algorithm for learning linear threshold functions in the Statistical Query model of Kearns.) Our algorithm is based on a new method for removing outliers in data. Specifically, for any set S of points in R n , each given to b bits of precision, we show that one can remove only a small fraction of S so that in the remaining set T , for every vector v, max x2T (v x) 2 poly(n; b)E x2T (v x) 2 ; i.e., for any hyperplane through the origin, the maximum distance (squared) from a point in T to the plane is at most polynomially larger than the average. After removing these outliers, we are able to show that a modified v...
A neuroidal architecture for cognitive computation
- Journal of the ACM
, 2000
"... Abstract. An architecture is described for designing systems that acquire and manipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and m ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
Abstract. An architecture is described for designing systems that acquire and manipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and maintenance of robust knowledge bases. The architecture makes explicit the requirements on the basic computational tasks that are to be performed and is designed to make these computationally tractable even for very large databases. The main claims are that (i) the basic learning and deduction tasks are provably tractable and (ii) tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are programmed. Among the issues that learning offers to resolve are robustness to inconsistencies, robustness to incomplete information and resolving among alternatives. Attribute-efficient learning algorithms, which allow learning from few examples in large dimensional systems, are fundamental to the approach. Underpinning the overall architecture is a new principled approach to manipulating relations in learning systems. This approach, of independently quantified arguments, allows propositional learning algorithms to be applied systematically to learning relational concepts in polynomial time and in a modular fashion.
Robust Logics
"... Suppose that we wish to learn from examples and counter-examples a criterion for recognizing whether an assembly of wooden blocks constitutes an arch. Suppose also that we have preprogrammed recognizers for various relationships e.g. on-top-of(x; y), above(x; y), etc. and believe that some possibl ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Suppose that we wish to learn from examples and counter-examples a criterion for recognizing whether an assembly of wooden blocks constitutes an arch. Suppose also that we have preprogrammed recognizers for various relationships e.g. on-top-of(x; y), above(x; y), etc. and believe that some possibly complex expression in terms of these base relationships should suffice to approximate the desired notion of an arch. How can we formulate such a relational learning problem so as to exploit the benefits that are demonstrably available in propositional learning, such as attribute-efficient learning by linear separators, and error-resilient learning? We believe that learning in a general setting that allows for multiple objects and relations in this way is a fundamental key to resolving the following dilemma that arises in the design of intelligent systems: Mathematical logic is an attractive language of description because it has clear semantics and sound proof procedures. However, as a basis for large programmed systems it leads to brittleness because, in practice, consistent usage of the various predicate names throughout a system cannot be guaranteed, except in application areas such as mathematics where the viability of the axiomatic method has been demonstrated independently. In this paper we develop the following approach to circumventing this dilemma. We suggest that brittleness can be overcome by using a new kind of logic in which each statement is learnable. By allowing the system to learn rules empirically from the environment, relative to any particular programs it may have for recognizing some base predicates, we enable the system to acquire a set of statements approximately consistent with each other and with the world, without the need for a globally knowledgeable and consistent programmer. We illustrate
Hardness of learning halfspaces with noise
- In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
, 2006
"... Learning an unknown halfspace (also called a perceptron) from labeled examples is one of the classic problems in machine learning. In the noise-free case, when a halfspace consistent with all the training examples exists, the problem can be solved in polynomial time using linear programming. However ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Learning an unknown halfspace (also called a perceptron) from labeled examples is one of the classic problems in machine learning. In the noise-free case, when a halfspace consistent with all the training examples exists, the problem can be solved in polynomial time using linear programming. However, under the promise that a halfspace consistent with a fraction (1 − ε) of the examples exists (for some small constant ε> 0), it was not known how to efficiently find a halfspace that is correct on even 51 % of the examples. Nor was a hardness result that ruled out getting agreement on more than 99.9 % of the examples known. In this work, we close this gap in our understanding, and prove that even a tiny amount of worst-case noise makes the problem of learning halfspaces intractable in a strong sense. Specifically, for arbitrary ε, δ> 0, we prove that given a set of examples-label pairs from the hypercube a fraction (1 − ε) of which can be explained by a halfspace, it is NP-hard to find a halfspace that correctly labels a fraction (1/2 + δ) of the examples. The hardness result is tight since it is trivial to get agreement on 1/2 the examples. In learning theory parlance, we prove that weak proper agnostic learning of halfspaces is hard. This settles a question that was raised by Blum et al. in their work on learning halfspaces in the presence of random classification noise [10], and in some more recent works as well. Along the way, we also obtain a strong hardness result for another basic computational problem: solving a linear system over the rationals. 1
A Simple Polynomial-time Rescaling Algorithm for Solving Linear Programs
- Proceedings of STOC’04
, 2004
"... The perceptron algorithm, developed mainly in the machine learning literature, is a simple greedy method for finding a feasible solution to a linear program (alternatively, for learning a threshold function.). In spite of its exponential worstcase complexity, it is often quite useful, in part due to ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
The perceptron algorithm, developed mainly in the machine learning literature, is a simple greedy method for finding a feasible solution to a linear program (alternatively, for learning a threshold function.). In spite of its exponential worstcase complexity, it is often quite useful, in part due to its noise-tolerance and also its overall simplicity. In this paper, we show that a randomized version of the perceptron algorithm with periodic rescaling runs in polynomial-time. The resulting algorithm for linear programming has an elementary description and analysis.
On PAC Learning using Winnow, Perceptron, and a Perceptron-Like Algorithm
"... In this paper we analyze the PAC learning abilities of several simple iterative algorithms for learning linear threshold functions, obtaining both positive and negative results. We show that Littlestone’s Winnow algorithm is not an efficient PAC learning algorithm for the class of positive linear th ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
In this paper we analyze the PAC learning abilities of several simple iterative algorithms for learning linear threshold functions, obtaining both positive and negative results. We show that Littlestone’s Winnow algorithm is not an efficient PAC learning algorithm for the class of positive linear threshold functions. We also prove that the Perceptron algorithm cannot efficiently learn the unrestricted class of linear threshold functions even under the uniform distribution on boolean examples. However, we show that the Perceptron algorithm can efficiently PAC learn the class of nested functions (a concept class known to be hard for Perceptron under arbitrary distributions) under the uniform distribution on boolean examples. Finally, we give a very simple Perceptron-like algorithm for learning origin-centered halfspaces under the uniform distribution on the unit sphere in R^n. Unlike the Perceptron algorithm, which cannot learn in the presence of classification noise, the new algorithm can learn in the presence of monotonic noise (a generalization of classification noise). The new algorithm is significantly faster than previous algorithms in both the classification and monotonic noise settings.
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
- Artificial Intelligence
, 1998
"... The absolute loss is the absolute difference between the desired and predicted outcome. This paper demonstrates worst-case upper bounds on the absolute loss for the Perceptron learning algorithm and the Exponentiated Update learning algorithm, which is related to the Weighted Majority algorithm. The ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The absolute loss is the absolute difference between the desired and predicted outcome. This paper demonstrates worst-case upper bounds on the absolute loss for the Perceptron learning algorithm and the Exponentiated Update learning algorithm, which is related to the Weighted Majority algorithm. The bounds characterize the behavior of the algorithms over any sequence of trials, where each trial consists of an example and a desired outcome interval (any value in the interval is an acceptable outcome). The worst-case absolute loss of both algorithms is bounded by: the absolute loss of the best linear function in a comparison class, plus a constant dependent on the initial weight vector, plus a per-trial loss. The per-trial loss can be eliminated if the learning algorithm is allowed a tolerance from the desired outcome. For concept learning, the worst-case bounds lead to mistake bounds that are comparable to past results. This paper is a revised and extended version of Bylander [7]. 1 ...
Agnostic Learning of Monomials by Halfspaces is Hard
"... Abstract — We prove the following strong hardness result for learning: Given a distribution on labeled examples from the hypercube such that there exists a monomial (or conjunction) consistent with (1 − ϵ)-fraction of the examples, it is NP-hard to find a halfspace that is correct on ( 1 +ϵ)-fractio ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Abstract — We prove the following strong hardness result for learning: Given a distribution on labeled examples from the hypercube such that there exists a monomial (or conjunction) consistent with (1 − ϵ)-fraction of the examples, it is NP-hard to find a halfspace that is correct on ( 1 +ϵ)-fraction of the examples, 2 for arbitrary constant ϵ> 0. In learning theory terms, weak agnostic learning of monomials by halfspaces is NP-hard. This hardness result bridges between and subsumes two previous results which showed similar hardness results for the proper learning of monomials and halfspaces. As immediate corollaries of our result, we give the first optimal hardness results for weak agnostic learning of decision lists and majorities. Our techniques are quite different from previous hardness proofs for learning. We use an invariance principle and sparse approximation of halfspaces from recent work on fooling halfspaces to give a new natural list decoding of a halfspace in the context of dictatorship tests/label cover reductions. In addition, unlike previous invariance principle based proofs which are only known to give Unique Games hardness, we give a reduction from a smooth version of Label Cover that is known to be NP-hard.
Learning Noisy Linear Threshold Functions
, 1998
"... This papers describes and analyzes algorithms for learning linear threshold function (LTFs) in the presence of classification noise and monotonic noise. When there is classification noise, each randomly drawn example is mislabeled (i.e., differs from the target LTF) with the same probability. For mo ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This papers describes and analyzes algorithms for learning linear threshold function (LTFs) in the presence of classification noise and monotonic noise. When there is classification noise, each randomly drawn example is mislabeled (i.e., differs from the target LTF) with the same probability. For monotonic noise, the probability of mislabeling an example monotonically decreases with the separation between the target LTF hyperplane and the example. Monotonic noise is a generalization of classification noise as well as the cases of independent binary features (aka naive Bayes) and normal distributions with equal covariance matrices. Monotonic noise provides a more realistic model of noise because it allows confidence to increase as a function of the distance from the threshold, but it does not impose any artificial form on the function. This paper shows that LTFs are polynomially PAC-learnable in the presence of classification noise and monotonic noise if the separation between examples ...

