Results 1 
7 of
7
Fast pruning using principal components
 ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, MORGANKAUFMANN
"... We present a new algorithm for eliminating excess parameters and improving network generalization after supervised training. The method, "Principal Components Pruning (PCP)", is based on principal component analysis of the node activations of successive layers of the network. It is simple, cheap to ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
We present a new algorithm for eliminating excess parameters and improving network generalization after supervised training. The method, "Principal Components Pruning (PCP)", is based on principal component analysis of the node activations of successive layers of the network. It is simple, cheap to implement, and effective. It requires no network retraining, and does not involve calculating the full Hessian of the cost function. Only the weight andthenode activity correlation matrices for each layer of nodes are required. We demonstrate the efficacy of the method on a regression problem using polynomial basis functions, and on an economic time series prediction problem using a twolayer, feedforward network.
Computing Second Derivatives in FeedForward Networks: a Review
 IEEE Transactions on Neural Networks
, 1994
"... . The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
. The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate algorithms for calculating second derivatives. For networks with jwj weights, simply writing the full matrix of second derivatives requires O(jwj 2 ) operations. For networks of radial basis units or sigmoid units, exact calculation of the necessary intermediate terms requires of the order of 2h + 2 backward/forwardpropagation passes where h is the number of hidden units in the network. We also review and compare three approximations (ignoring some components of the second derivative, numerical differentiation, and scoring). Our algorithms apply to arbitrary activation functions, networks, and error functions (for instance, with connections that skip layers, or radial basis functions, or ...
Speech Processing with Linear and Neural Network Models
, 1996
"... ion, for imposing continuity between models of adjacent speech segments, and learning rate adaptation, for improving backpropagation training, are discussed. For synthesising real speech utterances, an audio tape demonstrates that ARX models produce the highest quality synthetic speech and that the ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
ion, for imposing continuity between models of adjacent speech segments, and learning rate adaptation, for improving backpropagation training, are discussed. For synthesising real speech utterances, an audio tape demonstrates that ARX models produce the highest quality synthetic speech and that the quality is maintained when pitch modifications are applied. The second part of the dissertation studies the operation of recurrent neural networks in classifying patterns of correlated feature vectors. Such patterns are typical of speech classification tasks. The operation of a hidden node with a recurrent connection is explained in terms of a decision boundary which changes position in feature space. The feedback is shown to delay switching from one class to another and to smooth output decisions for sequences of feature vectors from the same class. For networks trained with constant class targets, a sequence of feature vectors from the same class tends to drive the operation of hidden nod
Does a meeting in Santa Fe imply Chaos?
 in [9
, 1994
"... This contribution compares the success of several nonlinear prediction techniques applied to the data series in sets a.dat and a.cont. The advantages of a new approach making predictions based on selective use of several different delay reconstructions are illustrated, and a comparison of both loca ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
This contribution compares the success of several nonlinear prediction techniques applied to the data series in sets a.dat and a.cont. The advantages of a new approach making predictions based on selective use of several different delay reconstructions are illustrated, and a comparison of both local linear and local nonlinear predictions is given. Given the limitations due to sampling rate and saturation in these data sets, the quality of the predictions achieved with very little information on the value of the initial condition (32 bits or less), in combination with the examination of the behavior of the system in the longer data set a.cont, suggests that, while the system is nonlinear, evidence for sensitivity to initial condition, if any, is slight. To appear in: Predicting the Future and Understanding the Past: A Comparison of Approaches, The Proceedings of the Comparative Time Series Analysis Workshop, Santa Fe, May 1992. Ed. by A. Weigend and N. Gersenfeld, AddisonWesley, 1993....
Optimal Stopping and Effective Machine Complexity in Learning
 Advances in Neural Information Processing Systems 6
, 1994
"... We study the problem of when to stop learning a class of feedforward networks  networks with linear outputs neuron and fixed input weights  when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there are in g ..."
Abstract
 Add to MetaCart
We study the problem of when to stop learning a class of feedforward networks  networks with linear outputs neuron and fixed input weights  when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there are in general three distinct phases in the generalization performance in the learning process, and in particular, the network has better generalization performance when learning is stopped at a certain time before the global minimum of the empirical error is reached. A notion of effective size of a machine is defined and used to explain the tradeoff between the complexity of the machine and the training error in the learning process. The study leads naturally to a network size selection criterion, which turns out to be a generalization of Akaike's Information Criterion for the learning process. It is shown that stopping learning before the global minimum of the empirical error has the effect of ne...
Bayes
, 1998
"... Variable selection for a multiple regression model (Noisy Linear Perceptron) is studied with a mean field approximation. In our Bayesian framework, variable selection is formulated as estimation of discrete parameters that indicate a subset of the explanatory variables. Then, a mean field approximat ..."
Abstract
 Add to MetaCart
Variable selection for a multiple regression model (Noisy Linear Perceptron) is studied with a mean field approximation. In our Bayesian framework, variable selection is formulated as estimation of discrete parameters that indicate a subset of the explanatory variables. Then, a mean field approximation is introduced for the calculation of the posterior averages over the discrete parameters. An application to a real world example, Boston housing data, is shown.
The Comparison of Deterministic and Stochastic VTG Schemes in the Application of MLP to Time Series Prediction
"... Abstract: Time series prediction is the process of forecasting a future measurement by analyzing the pattern, the trends and the relation of past measurements and the current measurement. Neural based approaches to time series prediction require the enough length of historical measurements to gener ..."
Abstract
 Add to MetaCart
Abstract: Time series prediction is the process of forecasting a future measurement by analyzing the pattern, the trends and the relation of past measurements and the current measurement. Neural based approaches to time series prediction require the enough length of historical measurements to generate the enough number of training patterns. The more training patterns, the better the generalization of MLP is. The researches about the schemes of generating artificial training patterns and adding to the original ones have been progressed and gave me the motivation of developing VTG schemes in 1996. In this paper, the stochastic VTG schemes will be proposed and compared with the deterministic VTG schemes proposed in 1996.