Results 1  10
of
21
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 742 (59 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Optimal Brain Damage
 Advances in Neural Information Processing Systems
, 1990
"... We have used informationtheoretic ideas to derive a class of practical and nearly optimal schemes for adapting the size of a neural network. By removing unimportant weights from a network, several improvements can be expected: better generalization, fewer training examples required, and improve ..."
Abstract

Cited by 420 (5 self)
 Add to MetaCart
We have used informationtheoretic ideas to derive a class of practical and nearly optimal schemes for adapting the size of a neural network. By removing unimportant weights from a network, several improvements can be expected: better generalization, fewer training examples required, and improved speed of learning and/or classification. The basic idea is to use secondderivative information to make a tradeoff between network complexity and training set error. Experiments confirm the usefulness of the methods on a realworld application. 1 INTRODUCTION Most successful applications of neural network learning to realworld problems have been achieved using highly structured networks of rather large size [for example (Waibel, 1989; LeCun et al., 1990)]. As applications become more complex, the networks will presumably become even larger and more structured. Design tools and techniques for comparing different architectures and minimizing the network size will be needed. More impor...
Connectionist Learning Procedures
 ARTIFICIAL INTELLIGENCE
, 1989
"... A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way ..."
Abstract

Cited by 339 (6 self)
 Add to MetaCart
A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way that internal units which are not part of the input or output come to represent important features of the task domain. Several interesting gradientdescent procedures have recently been discovered. Each connection computes the derivative, with respect to the connection strength, of a global measure of the error in the performance of the network. The strength is then adjusted in the direction that decreases the error. These relatively simple, gradientdescent learning procedures work well for small tasks and the new challenge is to find ways of improving their convergence rate and their generalization abilities so that they can be applied to larger, more realistic tasks.
Efficient BackProp
, 1998
"... . The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers expl ..."
Abstract

Cited by 127 (25 self)
 Add to MetaCart
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
Stochastic gradient learning in neural networks
 In Proceedings of NeuroNîmes. EC2
, 1991
"... Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z,w)) = J(z,w)dP(z) where dP is an unknown probability distribution that characterizes the problem to learn, and J, the loss function, defines the learning system itself. This popular statistical formulatio ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z,w)) = J(z,w)dP(z) where dP is an unknown probability distribution that characterizes the problem to learn, and J, the loss function, defines the learning system itself. This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent algorithm, e.g.: wt+1 = wt − ɛt∇wJ(z,wt) With some restrictions on J and C, this algorithm converges, even if J is non differentiable on a set of measure 0. Links with simulated annealing are depicted. Résumé De nombreux algorithmes connexionnistes consistent à minimiser un coût de la forme C(w) = E(J(z,w)) = J(z,w)dP(z) où dP est une distribution de probabilité inconnue qui caractérise le problème, et J, le critère local, décrit le système d’apprentissage lui même. Cette formulation statistique bien connue a donné lieu à de nombreux résultats théoriques. La minimisation d’un tel coût peut être accomplie au moyen d’un algorithme de descente stochastique de gradient, par exemple: wt+1 = wt − ɛt∇wJ(z,wt) Au prix de quelques restrictions sur C et J, cet algorithme converge, même si J n’est pas dérivable sur un ensemble de mesure nulle. Des liens avec les méthodes de recuit simulé sont également soulignés.
Automatic Learning Rate Maximization by OnLine Estimation of the Hessian's Eigenvectors
 Advances in Neural Information Processing Systems
, 1993
"... We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We propose a very simple, and well principled wayofcomputing the optimal step size in gradient descent algorithms. The online version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivativematrix (Hessian), which does not require to even calculate the Hessian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters. 1 INTRODUCTION Choosing the appropriate learning rate, or step size, in a gradient descent procedure such as backpropagation, is simultaneously one of the most crucial and expertintensive part of neuralnetwork learning. We propose a method for computing the best step size which is both wellprincipled, simple, very cheap computationally, and, most of all, applicable to online training with large ne...
Links between Perceptrons, MLPs and SVMs
 In: Proceedings of ICML. (2004
, 2004
"... We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea i ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin.
Modular Neural Networks and SelfDecomposition
, 1997
"... To embed modularity (i.e. to perform a local and encapsulated computation) into neural networks (NN) leads to many advantages. Hence, the development of a general model of modular neural networks (MNN) will enable a broader use of Neural Networks (NN). However, some important issues remain to be sol ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
To embed modularity (i.e. to perform a local and encapsulated computation) into neural networks (NN) leads to many advantages. Hence, the development of a general model of modular neural networks (MNN) will enable a broader use of Neural Networks (NN). However, some important issues remain to be solved to enable a systematic use of MNN. In a practical point of view, the most important matter concerns the decomposition of the task into subtasks. We have introduced here the concept of vertical and horizontal decomposition in order to classify the existing modular models capable of performing a selfdecomposition. The modular models available for a horizontal selfdecomposition (i.e. a clustering of the input space) are mainly the Local Model Network (LMN) and the algorithm of Jacobs and Jordan. Those two algorithms appear complementary. The convergence of the latter one is not ensured but the criterion it uses for decomposing the input space is far more ambitious and efficient than the s...
Modular Neural Networks: a state of the art
, 1995
"... The use of "global neural networks" (as the back propagation neural network) and "clustering neural networks" (as the radial basis function neural network) leads each other to different advantages and inconvenients. The combination of the desirable features ot those two neural ways of computation is ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
The use of "global neural networks" (as the back propagation neural network) and "clustering neural networks" (as the radial basis function neural network) leads each other to different advantages and inconvenients. The combination of the desirable features ot those two neural ways of computation is achieved by the use of Modular Neural Networks (MNN). In addition, a considerable advantage can emerge from the use of such a MNN: an interpreatable and relevant neural representation about the plant's behaviour. This very desirable feature for function approximation and especially for control problems, is what lake other neural models. This feature is so important that we introduce it as a way to differenciate MNN between other local computation models. However, to enable a systematic use of MNN three steps have to be achieved. First of all, the task has to be decomposed into subtasks, then the neural modules have to be properly organised considering the subtasks and finally a way of commu...
Evolutionary chromatographic law identification by recurrent neural nets
 Proceedings of the 4 th Annual Conference on Evolutionary Programming
, 1995
"... Analytic chromatography is a physical process whose aim is the separation of the components of a chemical mixture, based on their different affinities for some porous medium through which they are percolated. This paper presents an application of evolutionary recurrent neural nets optimization to th ..."
Abstract

Cited by 10 (10 self)
 Add to MetaCart
Analytic chromatography is a physical process whose aim is the separation of the components of a chemical mixture, based on their different affinities for some porous medium through which they are percolated. This paper presents an application of evolutionary recurrent neural nets optimization to the identification of the internal law of chromatography. New mutation operators involving the parameters of a single neuron are introduced. Furthermore, the strategy for using of the different kind of mutation takes into account the past history of the neural net at hand. The first results for one and twocomponent mixtures demonstrate the basic feasibility of the recurrent neural net approach. A strategy to improve the robustness of the results is presented.