Results 1  10
of
23
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 314 (65 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
 Machine Learning
, 1994
"... In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract

Cited by 109 (12 self)
 Add to MetaCart
In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1gvalued functions over an instance space X....
Learning in Linear Neural Networks: a Survey
 IEEE Transactions on neural networks
, 1995
"... Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and selforganisation can sometimes be answered analytically. We survey most of the known results on linear networks, including: (1) backpropagation learning and the structure ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and selforganisation can sometimes be answered analytically. We survey most of the known results on linear networks, including: (1) backpropagation learning and the structure of the error function landscape; (2) the temporal evolution of generalization; (3) unsupervised learning algorithms and their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. Keywords linear networks, supervised and unsupervised learning, Hebbian learning, principal components, generalization, local minima, selforganisation I. Introduction This paper addresses the problems of supervise...
Rigorous learning curve bounds from statistical mechanics
 Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract

Cited by 53 (9 self)
 Add to MetaCart
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the VapnikChervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problemspecific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distributionindependent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
On Weak Learning
 Journal of Computer and System Sciences
, 1995
"... This paper presents relationships between weak learning, weak prediction (where the probability of being correct is slightly larger than 50%), and consistency oracles (which decide whether or not a given set of examples is consistent with a concept in the class). Our main result is a simple polynomi ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
This paper presents relationships between weak learning, weak prediction (where the probability of being correct is slightly larger than 50%), and consistency oracles (which decide whether or not a given set of examples is consistent with a concept in the class). Our main result is a simple polynomial prediction algorithm which makes only a single query to a consistency oracle and whose predictions have a polynomial edge over random guessing. We compare this prediction algorithm with several of the standard prediction techniques, deriving an improved worst case bound on Gibbs Algorithm in the process. We use our algorithm to show that a concept class is polynomially learnable if and only if there is a polynomial probabilistic consistency oracle for the class. Since strong learning algorithms can be built from weak learning algorithms, our results also characterizes strong learnability.
Calculation of the Learning Curve of Bayes Optimal Classification Algorithm for Learning a Perceptron With Noise
 In Computational Learning Theory: Proceedings of the Fourth Annual Workshop
, 1991
"... The learning curve of Bayes optimal classification algorithm when learning a perceptron from noisy random training examples is calculated exactly in the limit of large training sample size and large instance space dimension using methods of statistical mechanics. It is shown that under certain assum ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
The learning curve of Bayes optimal classification algorithm when learning a perceptron from noisy random training examples is calculated exactly in the limit of large training sample size and large instance space dimension using methods of statistical mechanics. It is shown that under certain assumptions, in this "thermodynamic" limit, the probability of misclassification of Bayes optimal algorithm is less than that of a canonical stochastic learning algorithm, by a factor approaching p 2 as the ratio of number of training examples to instance space dimension grows. Exact asymptotic learning curves for both algorithms are derived for particular distributions. In addition, it is shown that the learning performance of Bayes optimal algorithm can be approximated by certain learning algorithms that use a neural net with a layer of hidden units to learn a perceptron. 1 Introduction Extending a line of research initiated by Elizabeth Gardner [Gar88, GD88], exceptional progress has been ...
AverageCase Learning Curves for Radial Basis Function Networks
 Neural Computation
, 1995
"... The application of statistical physics to the study of the learning curves of feedforward connectionist networks has, to date, been concerned mostly with networks that do not include hidden layers. Recent work has extended the theory to networks such as committee machines and parity machines; howeve ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The application of statistical physics to the study of the learning curves of feedforward connectionist networks has, to date, been concerned mostly with networks that do not include hidden layers. Recent work has extended the theory to networks such as committee machines and parity machines; however these are not networks that are often used in practice and an important direction for current and future research is the extension of the theory to practical connectionist networks. In this paper we investigate the learning curves of a class of networks that has been widely, and successfully applied to practical problems: the Gaussian radial basis function networks (RBFNs). We address the problem of learning linear and nonlinear, realizable and unrealizable, target rules from noisefree training examples using a stochastic training algorithm. Expressions for the generalization error, defined as the expected error for a network with a given set of parameters, are derived for general Gaussia...
Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework
, 1995
"... Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then c ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory.
Statistical Mechanics of Learning From Examples  I. General Formulation and Annealed Approximation
"... this paper we have studied the process of learning from examples with a stochastic training dynamics. The level of noise in the dynamics is denoted by the temperature T . One of the most important results of our analysis is that learning at finite temperature is possible, and sometimes advantageous. ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
this paper we have studied the process of learning from examples with a stochastic training dynamics. The level of noise in the dynamics is denoted by the temperature T . One of the most important results of our analysis is that learning at finite temperature is possible, and sometimes advantageous. For any finite T , as the number of examples increases, the network weights approach their optimal values, namely the values that minimize the generalization error. In Part I we have focused mainly on realizable rules. When trained with a fixed number of examples, our realizable models have average generalization errors that increase with T . Thus from purely equilibrium considerations working at T = 0 is better. On the other hand, the lower the temperature the longer it may take to reach equilibrium. This is particularly true for highly nonlinear models, such as the boolean perceptron with discrete weights. Although the critical number of examples per weight ff c (T ) increases with T
Noisy Reinforcement Training for pRAM Nets
 Neural Networks
, 1994
"... The use of additional noise in reinforcement training of probabilistic RAMS (pRAMs) is analysed in the context of pattern recognition. Both simulations and analysis indicate the effectiveness of the approach. ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The use of additional noise in reinforcement training of probabilistic RAMS (pRAMs) is analysed in the context of pattern recognition. Both simulations and analysis indicate the effectiveness of the approach.