Results 1  10
of
26
When Networks Disagree: Ensemble Methods for Hybrid Neural Networks
, 1993
"... This paper presents a general theoretical framework for ensemble methods of constructing significantly improved regression estimates. Given a population of regression estimators, we construct a hybrid estimator which is as good or better in the MSE sense than any estimator in the population. We argu ..."
Abstract

Cited by 290 (2 self)
 Add to MetaCart
This paper presents a general theoretical framework for ensemble methods of constructing significantly improved regression estimates. Given a population of regression estimators, we construct a hybrid estimator which is as good or better in the MSE sense than any estimator in the population. We argue that the ensemble method presented has several properties: 1) It efficiently uses all the networks of a population  none of the networks need be discarded. 2) It efficiently uses all the available data for training without overfitting. 3) It inherently performs regularization by smoothing in functional space which helps to avoid overfitting. 4) It utilizes local minima to construct improved estimates whereas other neural network algorithms are hindered by local minima. 5) It is ideally suited for parallel computation. 6) It leads to a very useful and natural measure of the number of distinct estimators in a population. 7) The optimal parameters of the ensemble estimator are given in clo...
Gaussian Processes for Regression
 Advances in Neural Information Processing Systems 8
, 1996
"... The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparame ..."
Abstract

Cited by 222 (19 self)
 Add to MetaCart
The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
"... Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a w ..."
Abstract

Cited by 128 (2 self)
 Add to MetaCart
Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a weight can be controlled by adding Gaussian noise and the noise level can be adapted during learning to optimize the tradeoff between the expected squared error of the network and the amount of information in the weights. We describe a method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of nonlinear hidden units. Provided the output units are linear, the exact derivatives can be computed efficiently without timeconsuming Monte Carlo simulations. The idea of minimizing the amount of information that is required to communicate the weights of a neural network leads to a number of interesting schemes for encoding the weights.
Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization
, 1993
"... ..."
Comparison of Approximate Methods for Handling Hyperparameters
 NEURAL COMPUTATION
"... I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants and noise levels. In the 'evidence framework' the model parameters are integrated over, and the resulting evid ..."
Abstract

Cited by 68 (1 self)
 Add to MetaCart
I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants and noise levels. In the 'evidence framework' the model parameters are integrated over, and the resulting evidence is maximized over the hyperparameters. The optimized
Bayesian Nonlinear Modelling for the Prediction Competition
 In ASHRAE Transactions, V.100, Pt.2
, 1994
"... . The 1993 energy prediction competition involved the prediction of a series of building energy loads from a series of environmental input variables. Nonlinear regression using `neural networks' is a popular technique for such modeling tasks. Since it is not obvious how large a timewindow of inpu ..."
Abstract

Cited by 51 (4 self)
 Add to MetaCart
. The 1993 energy prediction competition involved the prediction of a series of building energy loads from a series of environmental input variables. Nonlinear regression using `neural networks' is a popular technique for such modeling tasks. Since it is not obvious how large a timewindow of inputs is appropriate, or what preprocessing of inputs is best, this can be viewed as a regression problem in which there are many possible input variables, some of which may actually be irrelevant to the prediction of the output variable. Because a finite data set will show random correlations between the irrelevant inputs and the output, any conventional neural network (even with regularisation or `weight decay') will not set the coefficients for these junk inputs to zero. Thus the irrelevant variables will hurt the model's performance. The Automatic Relevance Determination (ARD) model puts a prior over the regression parameters which embodies the concept of relevance. This is done in a simple...
Bayesian Neural Networks and Density Networks
 Nuclear Instruments and Methods in Physics Research, A
, 1994
"... This paper reviews the Bayesian approach to learning in neural networks, then introduces a new adaptive model, the density network. This is a neural network for which target outputs are provided, but the inputs are unspecied. When a probability distribution is placed on the unknown inputs, a latent ..."
Abstract

Cited by 40 (8 self)
 Add to MetaCart
This paper reviews the Bayesian approach to learning in neural networks, then introduces a new adaptive model, the density network. This is a neural network for which target outputs are provided, but the inputs are unspecied. When a probability distribution is placed on the unknown inputs, a latent variable model is dened that is capable of discovering the underlying dimensionality of a data set. A Bayesian learning algorithm for these networks is derived and demonstrated. 1 Introduction to the Bayesian view of learning A binary classier is a parameterized mapping from an input x to an output y 2 [0; 1]); when its parameters w are specied, the classier states the probability that an input x belongs to class t = 1, rather than the alternative t = 0. Consider a binary classier which models the probability as a sigmoid function of x: P (t = 1jx; w;H) = y(x; w;H) = 1 1 + e wx (1) This form of model is known to statisticians as a linear logistic model, and in the neural networks ...
Fast, frugal, and rational: How rational norms explain behavior
 ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES
, 2003
"... Much research on judgment and decision making has focussed on the adequacy of classical rationality as a description of human reasoning. But more recently it has been argued that classical rationality should also be rejected even as normative standards for human reasoning. For example, Gigerenzer an ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Much research on judgment and decision making has focussed on the adequacy of classical rationality as a description of human reasoning. But more recently it has been argued that classical rationality should also be rejected even as normative standards for human reasoning. For example, Gigerenzer and Goldstein (1996) and Gigerenzer and Todd (1999a) argue that reasoning involves ‘‘fast and frugal’ ’ algorithms which are not justified by rational norms, but which succeed in the environment. They provide three lines of argument for this view, based on: (A) the importance of the environment; (B) the existence of cognitive limitations; and (C) the fact that an algorithm with no apparent rational basis, TaketheBest, succeeds in an judgment task (judging which of two cities is the larger, based on lists of features of each city). We reconsider (A)–(C), arguing that standard patterns of explanation in psychology and the social and biological sciences, use rational norms to explain why simple cognitive algorithms can succeed. We also present new computer simulations that compare TaketheBest with other cognitive models (which use connectionist, exemplarbased, and decisiontree algorithms). Although TaketheBest still performs well, it does not perform noticeably better than the other models. We conclude that these results provide no strong reason to prefer TaketheBest over alternative cognitive models.
Regression with Gaussian Processes
 In Learning in Graphical Models. MIT
, 1995
"... The Bayesian analysis of neural networks is difficult because the prior over functions has a complex form, leading to implementations that either make approximations or use Monte Carlo integration techniques. In this paper I investigate the use of Gaussian process priors over functions, which per ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
The Bayesian analysis of neural networks is difficult because the prior over functions has a complex form, leading to implementations that either make approximations or use Monte Carlo integration techniques. In this paper I investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis to be carried out exactly using matrix operations. The method has been tested on two challenging problems and has produced excellent results. 1 Introduction In the Bayesian approach to neural networks a prior distribution over the weights induces a prior distribution over functions. This prior is combined with a noise model, which specifies the probability of observing the targets t given function values y, to yield a posterior over functions which can then be used for predictions. For neural networks the prior over functions has a complex form which means that implementations must either make approximations [4] or use Monte Carlo approaches to evalua...
Hyperparameters: optimize, or integrate out?
 IN MAXIMUM ENTROPY AND BAYESIAN METHODS, SANTA BARBARA
, 1996
"... I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants. In the `evidence framework' the model parameters are integrated over, and the resulting evidence is maximized o ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants. In the `evidence framework' the model parameters are integrated over, and the resulting evidence is maximized over the hyperparameters. The optimized hyperparameters are used to define a Gaussian approximation to the posterior distribution. In the alternative `MAP' method, the true posterior probability is found by integrating over the hyperparameters. The true posterior is then maximized over the model parameters, and a Gaussian approximation is made. The similarities of the two approaches, and their relative merits, are discussed, and comparisons are made with the ideal hierarchical Bayesian solution. In moderately illposed problems, integration over hyperparameters yields a probability distribution with a skew peak which causes significant biases to arise in the MAP method. In contrast, the evidence framework is shown to introduce negligible predictive error, under straightforward conditions. General lessons are drawn concerning the distinctive properties of inference in many dimensions.