Results 21  30
of
78
Automatic Learning Rate Maximization by OnLine Estimation of the Hessian's Eigenvectors
 Advances in Neural Information Processing Systems
, 1993
"... We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivativematrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters. 1 INTRODUCTION Choosing the appropriate learning rate, or step size, in a gradient descent procedure such as backpropagation, is simultaneously one of the most crucial and expertintensive part of neuralnetwork learning. We propose a method for computing the best step size which is both wellprincipled, simple, very cheap computationally, and, most of all, applicable to online training with large ne...
Sammon’s mapping using neural networks: a comparison
 Pattern Recognition Letters
, 1997
"... A wellknown procedure for mapping data from a highdimensional space onto a lowerdimensional one is Sammon’s mapping. This algorithm preserves as well as possible all interpattern distances. A major disadvantage of the original algorithm lies in the fact that it is not easy to map hitherto unseen ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
A wellknown procedure for mapping data from a highdimensional space onto a lowerdimensional one is Sammon’s mapping. This algorithm preserves as well as possible all interpattern distances. A major disadvantage of the original algorithm lies in the fact that it is not easy to map hitherto unseen points. To overcome this problem, several methods have been proposed. In this paper, we aim to compare some approaches to implement this mapping on a neural network. q 1997
Online learning and stochastic approximations
 In Online Learning in Neural Networks
, 1998
"... The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once.
The CascadeCorrelation Learning Architecture
 Advances in Neural Information Processing Systems 2
, 1990
"... CascadeCorrelation is a new architecture and supervised learning algorithm for artificial neural networks. Instead of just adjusting the weights in a network of fixed topology, CascadeCorrelation begins with a minimal network, then automatically trains and adds new hidden units one by one, creatin ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
CascadeCorrelation is a new architecture and supervised learning algorithm for artificial neural networks. Instead of just adjusting the weights in a network of fixed topology, CascadeCorrelation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multilayer structure. Once a new hidden unit has been added to the network, its inputside weights are frozen. This unit then becomes a permanent featuredetector in the network, available for producing outputs or for creating other, more complex feature detectors. The CascadeCorrelation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no backpropagation of error signals through the connections of the network. This research was sponsored in part by the National Science Foundation under Contract Number EET87163...
Transforming NeuralNet Output Levels to Probability Distributions
 Advances in Neural Information Processing Systems 3
, 1991
"... : (1) The outputs of a typical multioutput classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
: (1) The outputs of a typical multioutput classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for instance by classical statistical estimation techniques. (2) We find that in cases of interest, neural networks are (and should be) somewhat underdetermined because the training data is always limited in quality and quantity. We present a method for computing the first two moments of the probability distribution indicating the range of outputs that are consistent with the input and the training data. It is particularly useful to combine these two ideas: we implement the ideas of section 1 using Parzen windows, where the shape and relative size of each window is computed using the ideas of section 2. This allows us to make contact between important theoretical ideas (e.g. the ensemble form...
Constructive Learning Techniques for Designing Neural Network Systems
, 1997
"... Contents 1. Introduction. 2. Classification. 2.1 Introduction. 2.2 The Pocket algorithm. 2.3 Tower and Cascade architectures. 2.4 Tree architectures: the Upstart algorithm. 2.5 Constructing tree and cascade architectures using dichotomies. 2.6 Constructing neural networks with a single hidden layer. ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Contents 1. Introduction. 2. Classification. 2.1 Introduction. 2.2 The Pocket algorithm. 2.3 Tower and Cascade architectures. 2.4 Tree architectures: the Upstart algorithm. 2.5 Constructing tree and cascade architectures using dichotomies. 2.6 Constructing neural networks with a single hidden layer. 2.7 Summary. 3. Regression. 3.1 Introduction. 3.2 The Cascade Correlation Algorithm. 3.3 Node creation and node splitting algorithms. 3.4 Constructing RBF networks. 3.5 Summary. 4. Constructing Modular Architectures. 4.1 Introduction. 4.2 Neural Decision Trees. 4.3 Other approaches to constructing modular networks. 5. Reducing Network Complexity. 5.1 Introduction. 5.2 Pruning Procedures. 5.3 Summary. 6. Conclusion. 7. Appendix: algorithms for singlenode learning. 1 1 Introduction Neural networks have been applied to a wide range of application domains such as control, telecommun
Nonmonotone Methods for Backpropagation Training with Adaptive Learning Rate
 IN: PROCEEDINGS OF THE IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN'99), WASHINGTON D.C
, 1999
"... In this paper, w present nonmonotone methods for feedforward neural network training, i.e. training methods in whicherPO function valuesar allowed to incrE2O at some iter6#OEq8 Mor specifically, at each epoch we impose that thecur#0 terPP function value must satisfy anAr[286O ypecr206Eq8# with rth e ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
In this paper, w present nonmonotone methods for feedforward neural network training, i.e. training methods in whicherPO function valuesar allowed to incrE2O at some iter6#OEq8 Mor specifically, at each epoch we impose that thecur#0 terPP function value must satisfy anAr[286O ypecr206Eq8# with rth ect to the maximumer[2 function value of Mpr]6[PE epochs. A str]OEq to dynamically adapt M is suggested and twotr[#[6E algorEq82 with adaptive lear6 ingr ates that successfully employ the above mentioned acceptability cr[[#P#E ar pr[ osed. ExperE28 tal rl sults show that the nonmonotone lear2]3 str][02 improves the convergence speed and the success rate of the methods considered.
Shared Weights Neural Networks in Image Analysis
, 1996
"... This thesis is concerned with the use of shared weights neural networks in image analysis. This type of neural network has been extensively described in literature since 1989. It is believed that networks incorporating shared weights are capable of local, shiftinvariant feature extraction due to th ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
This thesis is concerned with the use of shared weights neural networks in image analysis. This type of neural network has been extensively described in literature since 1989. It is believed that networks incorporating shared weights are capable of local, shiftinvariant feature extraction due to the restrictions placed on their architecture. The first experiments were focused mainly on the neural network architectures as suggested by, amongst others, Le Cun et al. [LBD + 89, LBD + 90, LJB + 89] and Viennet [Vie93]. These architectures basically are backpropagation neural networks. However, they restrain the number of free parameters and introduce the notion of receptive fields, combining local information into more abstract patterns at a higher level. Three of these networks were tested on the problem of handwritten digit recognition and the results were compared to those of methods based on other feature extraction or classification techniques. As an intermezzo, a second order...
An EM Approach to Learning Sequential Behavior
, 1994
"... We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Different subnetworks are model the dynamics (state transition) and the output of the ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Different subnetworks are model the dynamics (state transition) and the output of the model, conditional on the previous state and an external input. The model has a statistical interpretation and can be trained by the EM or GEM algorithms, considering state trajectories as missing data. This allows to decouple temporal credit assignment and actual parameters estimation. The model presents similarities to hidden Markov models, but allows to map input sequences to output sequences, using the same processing style of recurrent networks. For this reason we call it Input/Output HMM (IOHMM). Another remarkable difference is that IOHMMs are trained using a supervised learning paradigm (while potentially taking advantage of the EM algorithm), whereas standard HMMs are trained by an...
Variance Analysis of Sensitivity Information for Pruning Multilayer Feedforward Neural Networks
, 1999
"... This paper presents an algorithm to prune feedforward neural network architectures using sensitivity analysis. Sensitivity Analysis is used to quantify the relevance of input and hidden units. A new statistical pruning heuristic is proposed, based on variance analysis, to decide which units to prune ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
This paper presents an algorithm to prune feedforward neural network architectures using sensitivity analysis. Sensitivity Analysis is used to quantify the relevance of input and hidden units. A new statistical pruning heuristic is proposed, based on variance analysis, to decide which units to prune. Results are presented to show that the pruning algorithm correctly prunes irrelevant input and hidden units.