Results 1  10
of
38
A training algorithm for optimal margin classifiers
 PROCEEDINGS OF THE 5TH ANNUAL ACM WORKSHOP ON COMPUTATIONAL LEARNING THEORY
, 1992
"... A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of classifiaction functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjust ..."
Abstract

Cited by 1279 (44 self)
 Add to MetaCart
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of classifiaction functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leaveoneout method and the VCdimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms.
A Practical Bayesian Framework for Backprop Networks
 Neural Computation
, 1991
"... A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures ..."
Abstract

Cited by 398 (20 self)
 Add to MetaCart
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures
Query by Committee
, 1992
"... We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the highlow game and perceptron learning of another perceptr ..."
Abstract

Cited by 318 (3 self)
 Add to MetaCart
We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the highlow game and perceptron learning of another perceptron. As the number of queries goes to infinity, the committee algorithm yields asymptotically finite information gain. This leads to generalization error that decreases exponentially with the number of examples. This in marked contrast to learning from randomly chosen inputs, for which the information gain approaches zero and the generalization error decreases with a relatively slow inverse power law. We suggest that asymptotically finite information gain may be an important characteristic of good query algorithms.
Handwritten Digit Recognition with a BackPropagation Network
 Advances in Neural Information Processing Systems
, 1990
"... We present an application of backpropagation networks to handwritten digit recognition. Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated d ..."
Abstract

Cited by 186 (16 self)
 Add to MetaCart
We present an application of backpropagation networks to handwritten digit recognition. Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated digits. The method has 1% error rate and about a 9% reject rate on zipcode digits provided by the U.S. Postal Service. 1 INTRODUCTION The main point of this paper is to show that large backpropagation (BP) networks can be applied to real imagerecognition problems without a large, complex preprocessing stage requiring detailed engineering. Unlike most previous work on the subject (Denker et al., 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of BP networks to deal with large amounts of low level information. Previous work performed on simple digit images (Le Cun, 1989) showed that the architecture of the network strongly...
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
 Machine Learning
, 1994
"... In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract

Cited by 108 (12 self)
 Add to MetaCart
In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1gvalued functions over an instance space X....
A Simple Weight Decay Can Improve Generalization
 ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4
, 1992
"... It has been observed in numerical simulations that a weight decay can improve generalization in a feedforward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by cho ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
It has been observed in numerical simulations that a weight decay can improve generalization in a feedforward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and nonlinear units. Finally the theory is confirmed by some numerical simulations using the data from NetTalk.
Theory and Applications of Agnostic PACLearning with Small Decision Trees
, 1995
"... We exhibit a theoretically founded algorithm T2 for agnostic PAClearning of decision trees of at most 2 levels, whose computation time is almost linear in the size of the training set. We evaluate the performance of this learning algorithm T2 on 15 common "realworld" datasets, and show that for mo ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
We exhibit a theoretically founded algorithm T2 for agnostic PAClearning of decision trees of at most 2 levels, whose computation time is almost linear in the size of the training set. We evaluate the performance of this learning algorithm T2 on 15 common "realworld" datasets, and show that for most of these datasets T2 provides simple decision trees with little or no loss in predictive power (compared with C4.5). In fact, for datasets with continuous attributes its error rate tends to be lower than that of C4.5. To the best of our knowledge this is the first time that a PAClearning algorithm is shown to be applicable to "realworld" classification problems. Since one can prove that T2 is an agnostic PAClearning algorithm, T2 is guaranteed to produce close to optimal 2level decision trees from sufficiently large training sets for any (!) distribution of data. In this regard T2 differs strongly from all other learning algorithms that are considered in applied machine learning, for w...
Evolving Optimal Neural Networks Using Genetic Algorithms with Occam's Razor
 COMPLEX SYSTEMS
, 1993
"... Genetic algorithms have been used for neural networks in two main ways: to optimize the network architecture and to train the weights of a fixed architecture. While most previous work focuses on only one of these two options, this paper investigates an alternative evolutionary approach called Breed ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
Genetic algorithms have been used for neural networks in two main ways: to optimize the network architecture and to train the weights of a fixed architecture. While most previous work focuses on only one of these two options, this paper investigates an alternative evolutionary approach called Breeder Genetic Programming (BGP) in which the architecture and the weights are optimized simultaneously. The genotype of each network is represented as a tree whose depth and width are dynamically adapted to the particular application by specifically defined genetic operators. The weights are trained by a nextascent hillclimbing search. A new fitness function is proposed that quantifies the principle of Occam's razor. It makes an optimal tradeoff between the error fitting ability and the parsimony of the network. Simulation results on two benchmark problems of differing complexity suggest that the method finds minimal size networks on clean data. The experiments on noisy data show...
Combinations of Weak Classifiers
, 1997
"... To obtain classification systems with both good generalizatìon performance and efficiency in space and time, we propose a learning method based on combinations of weak classifiers, where weak classifiers are linear classifiers (perceptrons) which can do a little better than making random guesses. A ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
To obtain classification systems with both good generalizatìon performance and efficiency in space and time, we propose a learning method based on combinations of weak classifiers, where weak classifiers are linear classifiers (perceptrons) which can do a little better than making random guesses. A randomized algorithm is proposed to find the weak classifiers. They are then combined through a majority vote. As demonstrated through systematic experiments, the method developed is able to obtain combinations of weak classifiers with good generalization performance and a fast training time on a variety of test problems and real applications. Theoretical analysis on one of the test problems investigated in our experiments provides insights on when and why the proposed method works. In particular, when the strength of weak classifiers is properly chosen, combinations of weak classifiers can achieve a good generalization performance with polynomial space and timecomplexity.
On Overfitting Avoidance As Bias
 SFI TR
, 1993
"... In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions to data. ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions to data. In a recent paper, Schaffer (1993) presents experimental evidence disputing these claims. The current paper consists of a formal analysis of these contentions of Schaffer's. It proves that his contentions are valid, although some of his experiments must be interpreted with caution.