## Algorithmic Theories of Learning (1999)

Venue: | Foundations of Computer Science |

Citations: | 4 - 0 self |

### BibTeX

@INPROCEEDINGS{Arriaga99algorithmictheories,

author = {Rosa Arriaga and Santosh Vempala},

title = {Algorithmic Theories of Learning},

booktitle = {Foundations of Computer Science},

year = {1999}

}

### OpenURL

### Abstract

We study the phenomenon of cognitive learning from an algorithmic standpoint. How does the brain effectively learn concepts from a small number of examples, in spite of the fact that each example contains a huge amount of information? We provide a novel analysis for a model of robust concept learning (closely related to "margin classifiers"), and show that a relatively small number of examples are sufficient to learn rich concept classes (including threshold functions, boolean formulae and polynomial surfaces). As a result, we obtain simple intuitive proofs for the generalization bounds of Support Vector Machines. In addition, the new algorithms have several advantages --- they are faster, conceptually simpler, and highly resistant to noise. For example, a robust half-space can be PAC-learned in linear time using only a constant number of training examples, regardless of the number of attributes. A general (algorithmic) consequence of the model, that "more robust concepts are...

### Citations

2175 | Support-vector networks
- CORTES, V
- 1995
(Show Context)
Citation Context ...endently. Then we address the question of how many examples are needed to efficiently learn a concept with robustness `. The bounds we obtain here are very similar to those of Support Vector Machines =-=[5, 7]-=- although our algorithms and proofs are entirely different. We consider several rich concept classes, including half-spaces, intersections of half-spaces, DNF formulae, and polynomial surfaces. Using ... |

1879 |
An Introduction to Probability Theory and its Applications
- Feller
- 1995
(Show Context)
Citation Context ...osen from N(0; 1), then the random variables R T t u are also normally distributed (and identical), and their sum is a random variable which has the Chi-squared distribution with k degrees of freedom =-=[8]-=-. For simplicity, let Y = k juj 2 ju 0 j 2 . Then Y can be seen as the sum of the squares of k random variables each of which is distributed according to N(0; 1). As such it has the moment generating ... |

1696 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ... and a confidence parameter ffi, with probability at least 1 \Gamma ffi, the algorithm has to find a concept that has error at most ffl on D. Then the algorithm is said to PAC-learn the concept class =-=[25]. The basi-=-c insight is the idea of robustness. Intuitively, a concept is "robust" if it is immune to attribute noise. That is, modifying the attributes of an example by some bounded amount does not ch... |

1494 | Probability inequalities for sums of bounded random variables - Hoeffding - 1963 |

1140 | Geometric Algorithms and Combinatorial Optimization - Grötschel, Lovász, et al. - 1988 |

946 |
On the uniform convergence of relative frequencies of events to their probabilities,” Theory Probah
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...llows us to bound the number of examples required to learn the concepts as a function of `, independent of the original number of attributes, via an adaptation of the fundamental VC-dimension theorem =-=[28]-=-. To give a brief preview, for half-spaces in n-dimensional space with robustness `, O ( 1 ` 2 ) examples suffice, while for k-term DNF formulae on n variables, with the same robustness, O ( k ` 2 ) e... |

712 | Approximate nearest neighbors: towards removing the curse of dimensionality
- INDYK, MOTWANI
- 1998
(Show Context)
Citation Context ... whose entries are all chosen from either N(0; 1) or U(\Gamma1; 1), independently. The following theorem summarizes the results of this section. A version of this, for the case of N(0; 1) appeared in =-=[12]-=- (although the motivation there was to give a simple proof of the Johnson-Lindenstrauss lemma). Theorem 1 (Neuronal RP) Let u; v 2 R n . Let u 0 and v 0 be the projections of u and v to R k via a rand... |

674 | Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ... relatively small number of examples, when each example consists of a huge amount of information? An existing computational model that directly addresses this question is attribute-efficient learnings=-=[26, 17, 18]-=-. In this model, it is assumed that the target concept is simple in a specific manner: it is a function of only a small subset of the set of attributes, called the relevant attributes, while the rest ... |

674 |
Principles of categorization
- Rosch
- 1978
(Show Context)
Citation Context ...gory, subjects consistently list members that are closer to the prototype both earlier and more often (e.g. for the category Bird, the examples Sparrow and Robin are produced more often than Ostrich) =-=[22]-=-. Further when asked to classify instances, it is found that examples closer to the prototype are classified more quickly. Similar results were found in studies with artificially generated categories ... |

516 |
Perceptrons: an Introduction to Computational Geometry
- Minsky, Papert
- 1968
(Show Context)
Citation Context ... on a sample of O(n) examples. Typically, however, it is solved by using simple greedy methods. A commonly-used greedy algorithm is the Perceptron Algorithm [24, 1], which has the following guarantee =-=[19]-=-. Given a collection of data points in R n , each labeled as positive or negative, the algorithm will find a vector w such that w \Delta x ? 0 for all positive points x and w \Delta x ! 0 for all nega... |

451 | The geometry of graphs and some of its algorithmic applications
- Linial, London, et al.
- 1995
(Show Context)
Citation Context ...ed that random projection (approximately) preserves key properties of a set of points, e.g., the distances between pairs of points [13]; this has led to efficient algorithms in several other contexts =-=[14, 16, 29, 30]. In secti-=-on 3, we develop "neuronal" versions of random projection, i.e., we demonstrate that it is easy to implement it using a single layer of perceptrons where the weights of the network are chose... |

287 |
Principals of Neurodynamics
- Rosenblatt
- 1962
(Show Context)
Citation Context ...a polytime algorithm for linear programming on a sample of O(n) examples. Typically, however, it is solved by using simple greedy methods. A commonly-used greedy algorithm is the Perceptron Algorithm =-=[24, 1]-=-, which has the following guarantee [19]. Given a collection of data points in R n , each labeled as positive or negative, the algorithm will find a vector w such that w \Delta x ? 0 for all positive ... |

170 | Two algorithms for nearest-neighbor search in high dimensions
- KLEINBERG
- 1997
(Show Context)
Citation Context ...ed that random projection (approximately) preserves key properties of a set of points, e.g., the distances between pairs of points [13]; this has led to efficient algorithms in several other contexts =-=[14, 16, 29, 30]. In secti-=-on 3, we develop "neuronal" versions of random projection, i.e., we demonstrate that it is easy to implement it using a single layer of perceptrons where the weights of the network are chose... |

118 | Generalization performance of support vector machines and other pattern classifiers
- Bartlett, Shawe-Taylor
(Show Context)
Citation Context ...endently. Then we address the question of how many examples are needed to efficiently learn a concept with robustness `. The bounds we obtain here are very similar to those of Support Vector Machines =-=[5, 7]-=- although our algorithms and proofs are entirely different. We consider several rich concept classes, including half-spaces, intersections of half-spaces, DNF formulae, and polynomial surfaces. Using ... |

110 |
Extensions of Lipshitz mappings into a Hilbert space
- Johnson, Lindenstrauss
- 1984
(Show Context)
Citation Context ...w-dimensional space, is suitable for this purpose. It has been observed that random projection (approximately) preserves key properties of a set of points, e.g., the distances between pairs of points =-=[13]; this has-=- led to efficient algorithms in several other contexts [14, 16, 29, 30]. In section 3, we develop "neuronal" versions of random projection, i.e., we demonstrate that it is easy to implement ... |

107 |
Redundant noisy attributes, attribute errors, and linearthreshold learning using Winnow
- Littlestone
- 1991
(Show Context)
Citation Context ... relatively small number of examples, when each example consists of a huge amount of information? An existing computational model that directly addresses this question is attribute-efficient learnings=-=[26, 17, 18]-=-. In this model, it is assumed that the target concept is simple in a specific manner: it is a function of only a small subset of the set of attributes, called the relevant attributes, while the rest ... |

70 |
The relaxation method for linear inequalities
- Agmon
- 1964
(Show Context)
Citation Context ...a polytime algorithm for linear programming on a sample of O(n) examples. Typically, however, it is solved by using simple greedy methods. A commonly-used greedy algorithm is the Perceptron Algorithm =-=[24, 1]-=-, which has the following guarantee [19]. Given a collection of data points in R n , each labeled as positive or negative, the algorithm will find a vector w such that w \Delta x ? 0 for all positive ... |

70 |
Recent views of conceptual structure
- Komatsu
- 1992
(Show Context)
Citation Context ...entiated from one another and they are therefore the first categories we learn and the most important in language." A leading theory of how humans form categories is based on the notion of Protot=-=ypes [15, 9]-=-. Prototypes represent the most typical members of a category. The theory says that we abstract a prototype for a category by forming some weighted average of (a subset of) the defining features of ex... |

61 | A polynomialtime algorithm for learning noisy linear threshold functions
- Blum, Frieze, et al.
- 1998
(Show Context)
Citation Context ...heorem 4 An `-robust half-space in R n can be PAC learned using O( log 1 ` ` 2 ) examples in O( n ` 2 ) time. The Perceptron Algorithm is known to be tolerant to various types of classification noise =-=[4, 2, 6]-=-. It is a straightforward consequence that these properties continue to hold for our algorithm. In the concluding section we discuss straightforward bounds for agnostic learning. 4.2 Intersections of ... |

35 |
Learning noisy perceptrons by a perceptron in polynomial time
- Cohen
- 1997
(Show Context)
Citation Context ...heorem 4 An `-robust half-space in R n can be PAC learned using O( log 1 ` ` 2 ) examples in O( n ` 2 ) time. The Perceptron Algorithm is known to be tolerant to various types of classification noise =-=[4, 2, 6]-=-. It is a straightforward consequence that these properties continue to hold for our algorithm. In the concluding section we discuss straightforward bounds for agnostic learning. 4.2 Intersections of ... |

34 | A neuroidal architecture for cognitive computation
- Valiant
- 2000
(Show Context)
Citation Context ... relatively small number of examples, when each example consists of a huge amount of information? An existing computational model that directly addresses this question is attribute-efficient learnings=-=[26, 17, 18]-=-. In this model, it is assumed that the target concept is simple in a specific manner: it is a function of only a small subset of the set of attributes, called the relevant attributes, while the rest ... |

30 | Learning linear threshold functions in the presence of classification noise
- Bylander
- 1994
(Show Context)
Citation Context ...heorem 4 An `-robust half-space in R n can be PAC learned using O( log 1 ` ` 2 ) examples in O( n ` 2 ) time. The Perceptron Algorithm is known to be tolerant to various types of classification noise =-=[4, 2, 6]-=-. It is a straightforward consequence that these properties continue to hold for our algorithm. In the concluding section we discuss straightforward bounds for agnostic learning. 4.2 Intersections of ... |

29 |
Random Projection: A New Approach to VLSI Layout
- Vempala
- 1998
(Show Context)
Citation Context ...ed that random projection (approximately) preserves key properties of a set of points, e.g., the distances between pairs of points [13]; this has led to efficient algorithms in several other contexts =-=[14, 16, 29, 30]. In secti-=-on 3, we develop "neuronal" versions of random projection, i.e., we demonstrate that it is easy to implement it using a single layer of perceptrons where the weights of the network are chose... |

28 | A random sampling based algorithm for learning the intersection of halfspaces
- Vempala
- 1997
(Show Context)
Citation Context |

19 | Learning an intersection of k halfspaces over a uniform distribution
- Blum, Kannan
- 1993
(Show Context)
Citation Context ...y distribution or an arbitrary number of half-spaces. However efficient algorithms have been developed for reasonably general distributions assuming that the number of half-spaces is relatively small =-=[3, 29]-=-. Here we derive efficient learning algorithms for robust concepts in this class. Again, assume that all the half-spaces are homogenous. Let this class of convex concepts be denoted by H(m;n). A singl... |

5 |
Private Communication
- VALIANT
- 1985
(Show Context)
Citation Context ...ithms and proofs give a simple intuitive way to see the O(1=ffl 2 ) sample complexity bounds of margin classifiers (SVM's) [5]. They fit well with attempts to model cognition on a computational basis =-=[26, 27]-=-, and agree with the oft-observed phenomenon 2 that finer distinctions take more examples. From a purely computational viewpoint, we obtain new algorithms that address fundamental learning theory prob... |

4 |
Perceptual vs. conceptual categorization
- Reed, Friedman
- 1973
(Show Context)
Citation Context .... Further when asked to classify instances, it is found that examples closer to the prototype are classified more quickly. Similar results were found in studies with artificially generated categories =-=[21]-=-. For a prototype P , we could define a family of nested concepts within the category of P according to the distance from P . Then the members of the innermost concept are very similar to P , the memb... |

2 |
The Structure of Categories
- Glass, Holyoak, et al.
- 1979
(Show Context)
Citation Context ...entiated from one another and they are therefore the first categories we learn and the most important in language." A leading theory of how humans form categories is based on the notion of Protot=-=ypes [15, 9]-=-. Prototypes represent the most typical members of a category. The theory says that we abstract a prototype for a category by forming some weighted average of (a subset of) the defining features of ex... |

1 |
Categorization
- Reed
- 1982
(Show Context)
Citation Context ...y the model from which it was derived, we gathered evidence from psychological studies. Psychological literature reports that natural categories (concept classes) as formed by humans are hierarchical =-=[20]-=-, with three clear levels called the Superordinate, Basic level and Subordinate. For example, for the Superordinate category of Mammals, some Basic level categories are Elephant, Dog, Human, and the S... |

1 |
and P.Boyes-Braem, "Basic ojects in natural categories
- Rosch, Mervis, et al.
- 1976
(Show Context)
Citation Context ...important, and are the most clearly demarcated from each other. In our terminology they are the most robust, and we expect them to be easier to learn. This is indeed the case as noted by Rosch et al. =-=[23], "..-=-.basic level categories are the most differentiated from one another and they are therefore the first categories we learn and the most important in language." A leading theory of how humans form ... |