Results 1 
8 of
8
Efficient BackProp
, 1998
"... . The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers expl ..."
Abstract

Cited by 125 (24 self)
 Add to MetaCart
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
Static Versus Dynamic Sampling for Data Mining
 In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
, 1996
"... As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of datamining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does ..."
Abstract

Cited by 71 (0 self)
 Add to MetaCart
As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of datamining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the "Probably Close Enough" criterion to describe the desired properties of a sample. Sampling usually refers to the use of static statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledge of the tools the data miner intends to use. We discuss dynamic sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool's performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampl...
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 69 (4 self)
 Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Automatic Learning Rate Maximization by OnLine Estimation of the Hessian's Eigenvectors
 Advances in Neural Information Processing Systems
, 1993
"... We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivativematrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters. 1 INTRODUCTION Choosing the appropriate learning rate, or step size, in a gradient descent procedure such as backpropagation, is simultaneously one of the most crucial and expertintensive part of neuralnetwork learning. We propose a method for computing the best step size which is both wellprincipled, simple, very cheap computationally, and, most of all, applicable to online training with large ne...
Selected Training Exemplars for Neural Network Learning
, 1994
"... The dissertation of Mark Plutowski is approved, and it is acceptable in quality and form for publication on microfilm: CoChair CoChair ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
The dissertation of Mark Plutowski is approved, and it is acceptable in quality and form for publication on microfilm: CoChair CoChair
Automatic Learning Rate Maximization by OnLine Estimation of the Hessian's Eigenvectors
, 1993
"... We propose a very simple, and well principled way of computing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estim ..."
Abstract
 Add to MetaCart
We propose a very simple, and well principled way of computing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivative matrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters. 1 INTRODUCTION Choosing the appropriate learning rate, or step size, in a gradient descent procedure such as backpropagation, is simultaneously one of the most crucial and expertintensive part of neuralnetwork learning. We propose a method for computing the best step size which is both wellprincipled, simple, very cheap computationally, and, most of all, applicable to online training with large...
An Investigation of the Gradient Descent Process in Neural Networks
, 1996
"... not be interpreted as representing Usually gradient descent is merely a way to find a minimum, abandoned if a more efficient technique is available. Here we investigate the detailed properties of the gradient descent process, and the related topics of how gradients can be computed, what the limitati ..."
Abstract
 Add to MetaCart
not be interpreted as representing Usually gradient descent is merely a way to find a minimum, abandoned if a more efficient technique is available. Here we investigate the detailed properties of the gradient descent process, and the related topics of how gradients can be computed, what the limitations on gradient descent are, and how the secondorder information that governs the dynamics of gradient descent can be probed. To develop our intuitions, gradient descent is applied to a simple robot arm dynamics compensation problem, using backpropagation on a temporal windows architecture. The results suggest that smooth filters can be easily learned, but that the deterministic gradient descent process can be slow and can exhibit oscillations. Algorithms to compute the gradient of recurrent networks are then surveyed in a general framework, leading to some unifications, a deeper understanding of recurrent networks, and some algorithmic extensions. By regarding deterministic gradient descent as a dynamic system we obtain results concerning its convergence, and a quantitative theory of its behavior