Results 1 - 10
of
42
The Weighted Majority Algorithm
, 1994
"... We study the construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes. We are interested in the case that the learner has reason to believe that one of some pool of kn ..."
Abstract
-
Cited by 556 (37 self)
- Add to MetaCart
We study the construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes. We are interested in the case that the learner has reason to believe that one of some pool of known algorithms will perform well, but the learner does not know which one. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. We call this method the Weighted Majority Algorithm. We show that this algorithm is robust in the presence of errors in the data. We discuss various versions of the Weighted Majority Algorithm and prove mistake bounds for them that are closely related to the mistake bounds of the best algorithms of the pool. For example, given a sequence of trials, if there is an algorithm in the pool A that makes at most m mistakes then the Weighted Majority Algorithm will make at most c(log jAj + m) mi...
A Practical Bayesian Framework for Backprop Networks
- Neural Computation
, 1991
"... A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures ..."
Abstract
-
Cited by 347 (19 self)
- Add to MetaCart
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible: (1) objective comparisons between solutions using alternative network architectures
Error Correlation And Error Reduction In Ensemble Classifiers
, 1996
"... Using an ensemble of classifiers, instead of a single classifier, can lead to improved generalization. The gains obtained by combining however, are often affected more by the selection of what is presented to the combiner, than by the actual combining method that is chosen. In this paper we focus ..."
Abstract
-
Cited by 139 (21 self)
- Add to MetaCart
Using an ensemble of classifiers, instead of a single classifier, can lead to improved generalization. The gains obtained by combining however, are often affected more by the selection of what is presented to the combiner, than by the actual combining method that is chosen. In this paper we focus on data selection and classifier training methods, in order to "prepare" classifiers for combining. We review a combining framework for classification problems that quantifies the need for reducing the correlation among individual classifiers. Then, we discuss several methods that make the classifiers in an ensemble more complementary. Experimental results are provided to illustrate the benefits and pitfalls of reducing the correlation among classifiers, especially when the training data is in limited supply. 2 1 Introduction A classifier's ability to meaningfully respond to novel patterns, or generalize, is perhaps its most important property (Levin et al., 1990; Wolpert, 1990). In...
Linear and Order Statistics Combiners for Pattern Classification
- Combining Artificial Neural Nets
, 1999
"... Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification resul ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the "added" error. If N unbiased classifiers are combined by simple averaging, the added error rate can be reduced by a factor of N if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the ith order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.
On the Relationship Between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions
- NEURAL COMPUTATION
, 1996
"... Feedforward networks are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
Feedforward networks are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the generalization error can be decomposed in two terms: the approximation error, due to the insufficient representational capacity of a finite sized network, and the estimation error, due to insufficient information about the target function because of the finite number of samples. We then consider the problem of approximating functions belonging to certain Sobolev spaces with Gaussian Radial Basis Functions. Using the above mentioned decomposition we bound the generalization error in terms of the number of basis functions and number of examples. While the bound that we derive is specific for Radial Basis Functions, a number of observations deriving from it apply to any approximation t...
Learning in Linear Neural Networks: a Survey
- IEEE Transactions on neural networks
, 1995
"... Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and self-organisation can sometimes be answered analytically. We survey most of the known results on linear networks, including: (1) back-propagation learning and the structure ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and self-organisation can sometimes be answered analytically. We survey most of the known results on linear networks, including: (1) back-propagation learning and the structure of the error function landscape; (2) the temporal evolution of generalization; (3) unsupervised learning algorithms and their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on back-propagation networks and a unified view of all unsupervised algorithms. Keywords--- linear networks, supervised and unsupervised learning, Hebbian learning, principal components, generalization, local minima, self-organisation I. Introduction This paper addresses the problems of supervise...
Algebraic analysis for non-identifiable learning machines
- Neural Computation
"... This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a non-identifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously pr ..."
Abstract
-
Cited by 35 (13 self)
- Add to MetaCart
This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a non-identifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ1 log n − (m1 − 1) log log n+constant, where n is the number of training samples and λ1 and m1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ1 and m1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ1 is equal to the number of parameters and m1 = 1, whereas in nonregular models such as multilayer networks, 2λ1 is not larger than the number of parameters and m1 ≥ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the non-identifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1 1
Flat Minima
, 1997
"... this paper (available on the World-Wide Web; see our home pages) contains pseudo-code of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Mller (1993). Acknowledgments ..."
Abstract
-
Cited by 32 (13 self)
- Add to MetaCart
this paper (available on the World-Wide Web; see our home pages) contains pseudo-code of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Mller (1993). Acknowledgments
On-Line Learning Processes in Artificial Neural Networks
, 1993
"... We study on-line learning processes in artificial neural networks from a general point of view. On-line learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuous-time master equation. O ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
We study on-line learning processes in artificial neural networks from a general point of view. On-line learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuous-time master equation. On-line learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a time-dependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network's representation. Minimization of a criterion incorporating both effects yields an algorithm for on-line adaptation of the learning parameter. The inherent noise of on-line learning makes it possible to escape from undesired local minima of the error potential on which the learning rule performs (stochastic) gradient descent. We try to quantify these often made cl...
Theoretical Foundations Of Linear And Order Statistics Combiners For Neural Pattern Classifiers
- IEEE Transactions on neural networks
, 1996
"... : Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This paper provides an analytical framework to quantify the improvements in classification results ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
: Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This paper provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and the order statistics combiners introduced in this paper. We show that combining networks in output space reduces the variance of the actual decision region boundaries around the optimum boundary. For linear combiners, we show that in the absence of classifier bias, the added classification error is proportional to the boundary variance. For non-linear combiners, we show analytically that the selection of the median, the maximum and in general the ith order statistic improves classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions...

