Results 1  10
of
16
A Simple Weight Decay Can Improve Generalization
 ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4
, 1992
"... It has been observed in numerical simulations that a weight decay can improve generalization in a feedforward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by cho ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
It has been observed in numerical simulations that a weight decay can improve generalization in a feedforward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and nonlinear units. Finally the theory is confirmed by some numerical simulations using the data from NetTalk.
Rigorous learning curve bounds from statistical mechanics
 Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract

Cited by 53 (9 self)
 Add to MetaCart
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the VapnikChervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problemspecific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distributionindependent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
Four Types Of Learning Curves
 Neural Computation
, 1991
"... In learning from examples, the generalization error (t) is the average probability that an incorrect decision is made by a machine trained by t examples. The generalization error decreases as t increases, and the curve (t) is called a learning curve. The present paper uses the Bayesian approach to s ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
In learning from examples, the generalization error (t) is the average probability that an incorrect decision is made by a machine trained by t examples. The generalization error decreases as t increases, and the curve (t) is called a learning curve. The present paper uses the Bayesian approach to show that, under the random phase approximation (annealed approximation), learning curves are classified into four asymptotic types depending on situations. When a machine is deterministic with noiseless teacher signals, 1) ~ al when the correct machine parameter is unique in the parameter space, and 2) ~ at2 when the set of the correct parameters has a finite measure. When teacher signals are noisy, 3) ~ atl2 for a deterministic machine, and 4) ~ c + at 1 for a stochastic machine.
Annealed Theories of Learning
 In J.H
, 1995
"... We study annealed theories of learning boolean functions using a concept class of finite cardinality. The naive annealed theory can be used to derive a universal learning curve bound for zero temperature learning, similar to the inverse square root bound from the VapnikChervonenkis theory. Tighter, ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We study annealed theories of learning boolean functions using a concept class of finite cardinality. The naive annealed theory can be used to derive a universal learning curve bound for zero temperature learning, similar to the inverse square root bound from the VapnikChervonenkis theory. Tighter, nonuniversal learning curve bounds are also derived. A more refined annealed theory leads to still tighter bounds, which in some cases are very similar to results previously obtained using onestep replica symmetry breaking. 1. Introduction The annealed approximation 1 has proven to be an invaluable tool for studying the statistical mechanics of learning from examples. Previously it was found that the annealed approximation gave qualitatively correct results for several models of perceptrons learning realizable rules. 2 Because of its simplicity relative to the full quenched theory, the annealed approximation has since been used in studies of more complicated multilayer architectures. ...
Generativity and Systematicity in Neural Network Combinatorial Learning
, 1993
"... This thesis addresses a set of problems faced by connectionist learning that have originated from the observation that connectionist cognitive models lack two fundamental properties of the mind: Generativity, stemming from the boundless cognitive competence one can exhibit, and systematicity, due to ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This thesis addresses a set of problems faced by connectionist learning that have originated from the observation that connectionist cognitive models lack two fundamental properties of the mind: Generativity, stemming from the boundless cognitive competence one can exhibit, and systematicity, due to the existence of symmetries within them. Such properties have seldom been seen in neural networks models, which have typically suffered from problems of inadequate generalization, as examplified both by small number of generalizations relative to training set sizes and heavy interference between newly learned items and previously learned information. Symbolic theories, arguing that mental representations have syntactic and semantic structure built from structured combinations of symbolic constituents, can in principle account for these properties (both arise from the sensitivity of structured semantic content with a generative and systematic syntax). This thesis studies the question of whe...
Perturbation response in feedforward networks
 Neural Networks
, 1994
"... AbstraetFeedforward neural networks with continuousvalued activation functions have recently emerged as a powerful paradigm for modeling nonlinear systems. Several classes of such networks have been proved to possess universal approximation capabilities. Prominent among the advantages claimed for ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
AbstraetFeedforward neural networks with continuousvalued activation functions have recently emerged as a powerful paradigm for modeling nonlinear systems. Several classes of such networks have been proved to possess universal approximation capabilities. Prominent among the advantages claimed for such networks are robustness and distributedness of processing and representation. However, there has been little direct research on either issue, particularly the former, and these characteristics of neural networks have been accepted mostly on faith, or on the basis of heuristic arguments. In this paper, we attempt to construct a framework within which these very important issues can be addressed in a coherent and tractable manner The focus of the paper is on a particularly simple, but instructive, problem: to predict the effect of perturbations in internal neuron outputs on the performance of the network as a whole. This is directly useful in three ways: 1) it gives information about the network's tolerance of internal perturbations; 2) it can be used as a criterion for selecting among multiple network solutions to a given modeling problem; and 3) it provides a framework for relating the performance of a network to the performance of its components. Of these, the third is especially attractive because it can be used as the basis for a theory of distributed representation and processing in feedforward networks.
Characterizing Rational versus Exponential Learning Curves
 In Computational Learning Theory: Second European Conference. EuroCOLTâ€™95
, 1995
"... . We consider the standard problem of learning a concept from random examples. Here a learning curve can be defined to be the expected error of a learner's hypotheses as a function of training sample size. Haussler, Littlestone and Warmuth have shown that, in the distribution free setting, the smal ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
. We consider the standard problem of learning a concept from random examples. Here a learning curve can be defined to be the expected error of a learner's hypotheses as a function of training sample size. Haussler, Littlestone and Warmuth have shown that, in the distribution free setting, the smallest expected error a learner can achieve in the worst case over a concept class C converges rationally to zero error (i.e., \Theta(1=t) for training sample size t). However, recently Cohn and Tesauro have demonstrated how exponential convergence can often be observed in experimental settings (i.e., average error decreasing as e \Theta(\Gammat) ). By addressing a simple nonuniformity in the original analysis, this paper shows how the dichotomy between rational and exponential worst case learning curves can be recovered in the distribution free theory. These results support the experimental findings of Cohn and Tesauro: for finite concept classes, any consistent learner achieves exponent...
An Algorithmic Method to Build Good Training Sets for NeuralNetwork Classifiers
 In Proceedings of ICNNâ€™94, IEEE International Conference on Neural Networks
, 1994
"... s are available from the same host in the directory /pub/TR/UBLCS/ABSTRACTS in plain text format. All local authors can be reached via email at the address lastname@cs.unibo.it. UBLCS Technical Report Series 9319 HERMES: an Expert System for the Prognosis of Hepatic Diseases, I. Bonf a, C. Maioli ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
s are available from the same host in the directory /pub/TR/UBLCS/ABSTRACTS in plain text format. All local authors can be reached via email at the address lastname@cs.unibo.it. UBLCS Technical Report Series 9319 HERMES: an Expert System for the Prognosis of Hepatic Diseases, I. Bonf a, C. Maioli, F. Sarti, G.L. Milandri, P.R. Dal Monte, September 1993. 9320 An Information Flow Security Property for CCS, R. Focardi, R. Gorrieri, October 1993. 9321 A Classification of Security Properties, R. Focardi, R. Gorrieri, October 1993. 9322 Real Time Systems: A Tutorial, F. Panzieri, R. Davoli, October 1993. 9323 A Scalable Architecture for Reliable Distributed Multimedia Applications, F. Panzieri, M. Roccetti, October 1993. 9324 WideArea Distribution Issues in Hypertext Systems, C. Maioli, S. Sola, F. Vitali, October 1993. 9325 On Relating Some Models for Concurrency, P. Degano, R. Gorrieri, S. Vigna, October 1993. 9326 Axiomatising ST Bisimulation Equivalence, N. Busi, R. van Glabb...
Classification of all Stationary Points on a Neural Network Error Surface
, 1994
"... The artificial neural network with one hidden unit and the input nodes connected to the output node is considered. It is proven that the error surface of this network for the patterns of the XOR problem has minimum values with zero error and that all other stationary points of the error surface are ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
The artificial neural network with one hidden unit and the input nodes connected to the output node is considered. It is proven that the error surface of this network for the patterns of the XOR problem has minimum values with zero error and that all other stationary points of the error surface are saddle points. Also, the volume of the regions in weight space with saddle points is zero, hence training this network, using e.g. backpropagation with momentum, on the four patterns of the XOR problem, the correct solution with error zero will be reached in the limit with probability one. 1 Introduction A central theme in neural network research is to find the right network (architecture and learning algorithm) for a problem. Some learning algorithms also influence the architecture (pruning and construction, see e.g. [5, 7]). In our research [1, 2, 3] we are trying to generate good architectures for neural networks using a genetic algorithm which works on strings containing coded productio...