Results 1 -
5 of
5
Fast pruning using principal components
- ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, MORGAN-KAUFMANN
"... We present a new algorithm for eliminating excess parameters and improving network generalization after supervised training. The method, "Principal Components Pruning (PCP)", is based on principal component analysis of the node activations of successive layers of the network. It is simple, cheap to ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We present a new algorithm for eliminating excess parameters and improving network generalization after supervised training. The method, "Principal Components Pruning (PCP)", is based on principal component analysis of the node activations of successive layers of the network. It is simple, cheap to implement, and effective. It requires no network retraining, and does not involve calculating the full Hessian of the cost function. Only the weight andthenode activity correlation matrices for each layer of nodes are required. We demonstrate the efficacy of the method on a regression problem using polynomial basis functions, and on an economic time series prediction problem using a two-layer, feedforward network.
Computing Second Derivatives in Feed-Forward Networks: a Review
- IEEE Transactions on Neural Networks
, 1994
"... . The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
. The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate algorithms for calculating second derivatives. For networks with jwj weights, simply writing the full matrix of second derivatives requires O(jwj 2 ) operations. For networks of radial basis units or sigmoid units, exact calculation of the necessary intermediate terms requires of the order of 2h + 2 backward/forward-propagation passes where h is the number of hidden units in the network. We also review and compare three approximations (ignoring some components of the second derivative, numerical differentiation, and scoring). Our algorithms apply to arbitrary activation functions, networks, and error functions (for instance, with connections that skip layers, or radial basis functions, or ...
Speech Processing with Linear and Neural Network Models
, 1996
"... ion, for imposing continuity between models of adjacent speech segments, and learning rate adaptation, for improving back-propagation training, are discussed. For synthesising real speech utterances, an audio tape demonstrates that ARX models produce the highest quality synthetic speech and that the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
ion, for imposing continuity between models of adjacent speech segments, and learning rate adaptation, for improving back-propagation training, are discussed. For synthesising real speech utterances, an audio tape demonstrates that ARX models produce the highest quality synthetic speech and that the quality is maintained when pitch modifications are applied. The second part of the dissertation studies the operation of recurrent neural networks in classifying patterns of correlated feature vectors. Such patterns are typical of speech classification tasks. The operation of a hidden node with a recurrent connection is explained in terms of a decision boundary which changes position in feature space. The feedback is shown to delay switching from one class to another and to smooth output decisions for sequences of feature vectors from the same class. For networks trained with constant class targets, a sequence of feature vectors from the same class tends to drive the operation of hidden nod
Does a meeting in Santa Fe imply Chaos?
- in [9
, 1994
"... This contribution compares the success of several nonlinear prediction techniques applied to the data series in sets a.dat and a.cont. The advantages of a new approach making predictions based on selective use of several different delay reconstructions are illustrated, and a comparison of both loca ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This contribution compares the success of several nonlinear prediction techniques applied to the data series in sets a.dat and a.cont. The advantages of a new approach making predictions based on selective use of several different delay reconstructions are illustrated, and a comparison of both local linear and local nonlinear predictions is given. Given the limitations due to sampling rate and saturation in these data sets, the quality of the predictions achieved with very little information on the value of the initial condition (32 bits or less), in combination with the examination of the behavior of the system in the longer data set a.cont, suggests that, while the system is nonlinear, evidence for sensitivity to initial condition, if any, is slight. To appear in: Predicting the Future and Understanding the Past: A Comparison of Approaches, The Proceedings of the Comparative Time Series Analysis Workshop, Santa Fe, May 1992. Ed. by A. Weigend and N. Gersenfeld, Addison-Wesley, 1993....
Optimal Stopping and Effective Machine Complexity in Learning
- Advances in Neural Information Processing Systems 6
, 1994
"... We study the problem of when to stop learning a class of feedforward networks -- networks with linear outputs neuron and fixed input weights -- when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there are in g ..."
Abstract
- Add to MetaCart
We study the problem of when to stop learning a class of feedforward networks -- networks with linear outputs neuron and fixed input weights -- when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there are in general three distinct phases in the generalization performance in the learning process, and in particular, the network has better generalization performance when learning is stopped at a certain time before the global minimum of the empirical error is reached. A notion of effective size of a machine is defined and used to explain the trade-off between the complexity of the machine and the training error in the learning process. The study leads naturally to a network size selection criterion, which turns out to be a generalization of Akaike's Information Criterion for the learning process. It is shown that stopping learning before the global minimum of the empirical error has the effect of ne...

