Results 1 
7 of
7
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 72 (4 self)
 Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Computing Second Derivatives in FeedForward Networks: a Review
 IEEE Transactions on Neural Networks
, 1994
"... . The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
. The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate algorithms for calculating second derivatives. For networks with jwj weights, simply writing the full matrix of second derivatives requires O(jwj 2 ) operations. For networks of radial basis units or sigmoid units, exact calculation of the necessary intermediate terms requires of the order of 2h + 2 backward/forwardpropagation passes where h is the number of hidden units in the network. We also review and compare three approximations (ignoring some components of the second derivative, numerical differentiation, and scoring). Our algorithms apply to arbitrary activation functions, networks, and error functions (for instance, with connections that skip layers, or radial basis functions, or ...
New millennium AI and the convergence of history
 Challenges to Computational Intelligence
, 2007
"... Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequenceprocessing programs, as opposed to traditional methods limited to stationary pattern association. Here we will briefly review some of the new results, and speculate about future developments, pointing out that the time intervals between the most notable events in over 40,000 years or 2 9 lifetimes of human history have sped up exponentially, apparently converging to zero within the next few decades. Or is this impression just a byproduct of the way humans allocate memory space to past events? 1
Unified Formulation for Training Recurrent Networks with Derivative Adaptive Critics
 In Proc. of the IEEE International Conference on Neural Networks (ICNN
, 1997
"... ..."
Learning in Networks
, 1995
"... Intelligent systems require software incorporating probabilistic reasoning, and often times learning. Networks provide a framework and methodology for creating this kind of software. This paper introduces network models based on chain graphs with deterministic nodes. Chain graphs are defined as a hi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Intelligent systems require software incorporating probabilistic reasoning, and often times learning. Networks provide a framework and methodology for creating this kind of software. This paper introduces network models based on chain graphs with deterministic nodes. Chain graphs are defined as a hierarchical combination of Bayesian and Markov networks. To model learning, plates on chain graphs are introduced to model independent samples. The paper concludes by discussing various operations that can be performed on chain graphs with plates as a simplification process or to generate learning algorithms. Un systeme intelligent doit necessairement inclure un module de raisonement probabiliste et meme bien souvent des mechanismes d'apprentissage. Les reseaux offrent un cadre et une methodologie pour creer de tels logiciels. Ce papier introduit des modeles de reseaux bases sur les graphes en chaine avec noeuds deterministes. Un graphe en chaine est defini comme etant une combinaison hierarc...
DRAFT Invited paperfor 50th Sessionof the InternationalStatisticalInstitute, Beijing, China, August,1995. LEARNING IN NETWORKS
, 1995
"... Intelligent systems require software incorporating probabilistic reasoning, and often times learning. Networks provide a framework and methodology for creating this kind of software. This paper introduces network models based on chain graphs with deterministic nodes. Chain graphs are defined as a hi ..."
Abstract
 Add to MetaCart
Intelligent systems require software incorporating probabilistic reasoning, and often times learning. Networks provide a framework and methodology for creating this kind of software. This paper introduces network models based on chain graphs with deterministic nodes. Chain graphs are defined as a hierarchical combination of Bayesian and Markov networks. To model learning, plates on chain graphs are introduced to model independent samples. The paper concludes by discussing various operations that can be performed on chain graphs with plates as a simplification process or to generate learning algorithms. Un systeme intelligent doit necessairement inclure un module de raisonement probabiliste et meme bien souvent des mechanismes d'apprentissage. Les reseaux offrent un cadre et une methodologie pour creer de tels logiciels. Ce papier introduit des modeles de reseaux bases sur les graphes en chaine avec noeuds deterministes. Un graphe en chaine est defini comme etant une combinaison hierarchique de reseaux Bayesiens et de reseaux de Markov. Afin de modeliser l'apprentissage, j'introduit des couches dans ces
Determination of the Regularization Parameter for Support Vector Machines via Vasconcelos ’ Genetic Algorithm
"... Abstract: This paper presents a genetic algorithm (GA) methodology for training a support vector machine (SVM). The SVM method may be viewed as a quadratic optimization problem with linear constraints, where the objective is to minimize the error learning rate and the VapnikChervonenkis (VC) dimen ..."
Abstract
 Add to MetaCart
Abstract: This paper presents a genetic algorithm (GA) methodology for training a support vector machine (SVM). The SVM method may be viewed as a quadratic optimization problem with linear constraints, where the objective is to minimize the error learning rate and the VapnikChervonenkis (VC) dimension in order to get an Optimal Separating Hyperplane (OSH) that classifies two sets of data. A SVM is a very good tool for classification problems which displays an excellent generalization ability. In order to test our method we solve the XOR problem, a canonical nonlinearly separable problem. We used a genetic algorithm (GA) called Vasconcelos ’ GA (VGA). The genome was selected to solve the dual SVM problem, where each individual corresponds to a Lagrange multiplier. Our interest lay in finding the “best ” value of C (the socalled “regularization ” parameter); C reflects a tradeoff between the performance of the trained SVM and its allowed level of misclassification. We solved the problem in two ways: (a) We provided C, as is traditional in the normal treatment of the problem; (b) We implemented a complementary approach, wherein C is also included in the genome. In case (b) VGA finds C’s value freeing the user from having to find it from heuristics. We report an exact solution for case (a) and, importantly, encouraging results in which the error in the solution for case (b) is practically zero.