Results 1  10
of
31
When Networks Disagree: Ensemble Methods for Hybrid Neural Networks
, 1993
"... This paper presents a general theoretical framework for ensemble methods of constructing significantly improved regression estimates. Given a population of regression estimators, we construct a hybrid estimator which is as good or better in the MSE sense than any estimator in the population. We argu ..."
Abstract

Cited by 326 (2 self)
 Add to MetaCart
(Show Context)
This paper presents a general theoretical framework for ensemble methods of constructing significantly improved regression estimates. Given a population of regression estimators, we construct a hybrid estimator which is as good or better in the MSE sense than any estimator in the population. We argue that the ensemble method presented has several properties: 1) It efficiently uses all the networks of a population  none of the networks need be discarded. 2) It efficiently uses all the available data for training without overfitting. 3) It inherently performs regularization by smoothing in functional space which helps to avoid overfitting. 4) It utilizes local minima to construct improved estimates whereas other neural network algorithms are hindered by local minima. 5) It is ideally suited for parallel computation. 6) It leads to a very useful and natural measure of the number of distinct estimators in a population. 7) The optimal parameters of the ensemble estimator are given in clo...
Gaussian Processes for Regression
 Advances in Neural Information Processing Systems 8
, 1996
"... The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparame ..."
Abstract

Cited by 254 (21 self)
 Add to MetaCart
(Show Context)
The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
"... Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a w ..."
Abstract

Cited by 145 (2 self)
 Add to MetaCart
Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a weight can be controlled by adding Gaussian noise and the noise level can be adapted during learning to optimize the tradeoff between the expected squared error of the network and the amount of information in the weights. We describe a method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of nonlinear hidden units. Provided the output units are linear, the exact derivatives can be computed efficiently without timeconsuming Monte Carlo simulations. The idea of minimizing the amount of information that is required to communicate the weights of a neural network leads to a number of interesting schemes for encoding the weights.
Comparison of Approximate Methods for Handling Hyperparameters
 NEURAL COMPUTATION
"... I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants and noise levels. In the 'evidence framework' the model parameters are integrated over, and the resu ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models which include unknown hyperparameters such as regularization constants and noise levels. In the 'evidence framework' the model parameters are integrated over, and the resulting evidence is maximized over the hyperparameters. The optimized
Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization
, 1993
"... ..."
Bayesian Nonlinear Modelling for the Prediction Competition
 In ASHRAE Transactions, V.100, Pt.2
, 1994
"... . The 1993 energy prediction competition involved the prediction of a series of building energy loads from a series of environmental input variables. Nonlinear regression using `neural networks' is a popular technique for such modeling tasks. Since it is not obvious how large a timewindow of ..."
Abstract

Cited by 63 (4 self)
 Add to MetaCart
(Show Context)
. The 1993 energy prediction competition involved the prediction of a series of building energy loads from a series of environmental input variables. Nonlinear regression using `neural networks' is a popular technique for such modeling tasks. Since it is not obvious how large a timewindow of inputs is appropriate, or what preprocessing of inputs is best, this can be viewed as a regression problem in which there are many possible input variables, some of which may actually be irrelevant to the prediction of the output variable. Because a finite data set will show random correlations between the irrelevant inputs and the output, any conventional neural network (even with regularisation or `weight decay') will not set the coefficients for these junk inputs to zero. Thus the irrelevant variables will hurt the model's performance. The Automatic Relevance Determination (ARD) model puts a prior over the regression parameters which embodies the concept of relevance. This is done in a simple...
Bayesian Neural Networks and Density Networks
 Nuclear Instruments and Methods in Physics Research, A
, 1994
"... This paper reviews the Bayesian approach to learning in neural networks, then introduces a new adaptive model, the density network. This is a neural network for which target outputs are provided, but the inputs are unspecied. When a probability distribution is placed on the unknown inputs, a latent ..."
Abstract

Cited by 46 (7 self)
 Add to MetaCart
(Show Context)
This paper reviews the Bayesian approach to learning in neural networks, then introduces a new adaptive model, the density network. This is a neural network for which target outputs are provided, but the inputs are unspecied. When a probability distribution is placed on the unknown inputs, a latent variable model is dened that is capable of discovering the underlying dimensionality of a data set. A Bayesian learning algorithm for these networks is derived and demonstrated. 1 Introduction to the Bayesian view of learning A binary classier is a parameterized mapping from an input x to an output y 2 [0; 1]); when its parameters w are specied, the classier states the probability that an input x belongs to class t = 1, rather than the alternative t = 0. Consider a binary classier which models the probability as a sigmoid function of x: P (t = 1jx; w;H) = y(x; w;H) = 1 1 + e wx (1) This form of model is known to statisticians as a linear logistic model, and in the neural networks ...
Fast, frugal, and rational: How rational norms explain behavior
 ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES
, 2003
"... Much research on judgment and decision making has focussed on the adequacy of classical rationality as a description of human reasoning. But more recently it has been argued that classical rationality should also be rejected even as normative standards for human reasoning. For example, Gigerenzer an ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
Much research on judgment and decision making has focussed on the adequacy of classical rationality as a description of human reasoning. But more recently it has been argued that classical rationality should also be rejected even as normative standards for human reasoning. For example, Gigerenzer and Goldstein (1996) and Gigerenzer and Todd (1999a) argue that reasoning involves ‘‘fast and frugal’ ’ algorithms which are not justified by rational norms, but which succeed in the environment. They provide three lines of argument for this view, based on: (A) the importance of the environment; (B) the existence of cognitive limitations; and (C) the fact that an algorithm with no apparent rational basis, TaketheBest, succeeds in an judgment task (judging which of two cities is the larger, based on lists of features of each city). We reconsider (A)–(C), arguing that standard patterns of explanation in psychology and the social and biological sciences, use rational norms to explain why simple cognitive algorithms can succeed. We also present new computer simulations that compare TaketheBest with other cognitive models (which use connectionist, exemplarbased, and decisiontree algorithms). Although TaketheBest still performs well, it does not perform noticeably better than the other models. We conclude that these results provide no strong reason to prefer TaketheBest over alternative cognitive models.
Linearly combining density estimators via stacking
 Machine Learning
, 1999
"... This paper presents experimental results with both real and artificial data on using the technique of stacking to combine unsupervised learning algorithms. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for nonparametric multivariat ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
This paper presents experimental results with both real and artificial data on using the technique of stacking to combine unsupervised learning algorithms. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for nonparametric multivariate density estimation. The method is found to outperform other strategies such as choosing the single best model based on crossvalidation, combining with uniform weights, and even using the single best model chosen by “cheating ” and examining the test set. We also investigate in detail how the utility of stacking changes when one of the models being combined generated the data; how the stacking coefficients of the models compare to the relative frequencies with which crossvalidation chooses among the models; visualization of combined “effective ” kernels; and the sensitivity of stacking to overfitting as model complexity increases. In an extended version of this paper we also investigate how stacking performs using L1 and L2 performance measures (for which one must know the true density) rather than loglikelihood (Smyth and Wolpert 1998). 1
Bayesian Regularisation and Pruning using a Laplace Prior
 Neural Computation
, 1994
"... Standard techniques for improved generalisation from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy indicates a Laplace rather than a G ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Standard techniques for improved generalisation from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy indicates a Laplace rather than a Gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error (2) those failing to achieve this sensitivity and which therefore vanish. Since the critical value is determined adaptively during training, pruningin the sense of setting weights to exact zerosbecomes a consequence of regularisation alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a Gaussian regulariser. 1 Introduction Neural networks designed for regression or classification need to be trained using some form of stabilisation or re...