Results 1  10
of
10
A Bayesian Committee Machine
 NEURAL COMPUTATION
, 2000
"... The Bayesian committee machine (BCM) is a novel approach to combining estimators which were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators the main foci are Gaussian process regression and related systems such as regularization networks a ..."
Abstract

Cited by 73 (7 self)
 Add to MetaCart
The Bayesian committee machine (BCM) is a novel approach to combining estimators which were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators the main foci are Gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly, we nd that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator. The BCM also provides a new solution for online learning with potential applications to data mining. We apply the BCM to systems with fixed basis functions and discuss its relationship to Gaussian process regression. Finally, we also show how the ideas behind the BCM can be applied in a nonBayesian setting to extend the input dependent combination of estimators.
Computation With Infinite Neural Networks
, 1997
"... For neural networks with a wide class of weight priors, it can be shown that in the limit of an infinite number of hidden units the prior over functions tends to a Gaussian process. In this paper analytic forms are derived for the covariance function of the Gaussian processes corresponding to networ ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
For neural networks with a wide class of weight priors, it can be shown that in the limit of an infinite number of hidden units the prior over functions tends to a Gaussian process. In this paper analytic forms are derived for the covariance function of the Gaussian processes corresponding to networks with sigmoidal and Gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units, and shows that, somewhat paradoxically, it may be easier to carry out Bayesian prediction with infinite networks rather than finite ones. 1 Introduction To someone training a neural network by maximizing the likelihood of a finite amount of data it makes no sense to use a network with an infinite number of hidden units; the network will "overfit" the data and so will be expected to generalize poorly. However, the idea of selecting the network size depending on the amount of training data makes little sense to a Bayesian; a model should be chosen...
Adaptive Regularization in Neural Network Modeling
, 1997
"... . In this paper we address the important problem of optimizing regularization parameters in neural network modeling. The suggested optimization scheme is an extended version of the recently presented algorithm [24]. The idea is to minimize an empirical estimate  like the crossvalidation estimate ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
. In this paper we address the important problem of optimizing regularization parameters in neural network modeling. The suggested optimization scheme is an extended version of the recently presented algorithm [24]. The idea is to minimize an empirical estimate  like the crossvalidation estimate  of the generalization error with respect to regularization parameters. This is done by employing a simple iterative gradient descent scheme using virtually no additional programming overhead compared to standard training. Experiments with feedforward neural network models for time series prediction and classification tasks showed the viability and robustness of the algorithm. Moreover, we provided some simple theoretical examples in order to illustrate the potential and limitations of the proposed regularization framework. 1 Introduction Neural networks are flexible tools for time series processing and pattern recognition. By increasing the number of hidden neurons in a 2layer architec...
Bias of Estimators and Regularization Terms
 in Proceedings of 1998 Workshop on InformationBased Induction Sciences (IBIS'98), Izu
, 1998
"... : In this paper, a role of regularization terms (penalty terms) is discussed from the view point of minimizing the generalization error. First the bias of minimum training error estimation is clarified. The bias is caused by the nonlinearity of the learning system and depends on the number of traini ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: In this paper, a role of regularization terms (penalty terms) is discussed from the view point of minimizing the generalization error. First the bias of minimum training error estimation is clarified. The bias is caused by the nonlinearity of the learning system and depends on the number of training examples. Then an appropriate size of the regularization term is considered by taking account of the balance of the bias and the variance of the estimator, so that the generalization error is minimized. In this framework, the optimal size of the regularization term is calculated with the second and third order derivatives of the loss function. When the learning system has a large number of modifiable parameters, it is computationally expensive to calculate the higher order derivatives, thus we propose a simple method of approximating the optimal size via a generalized AIC. 1 Introduction In order to avoid "overfitting" problem of learning machines, such as neural networks, regularizatio...
Accuracy versus interpretability in flexible modeling: implementing a tradeoff using Gaussian process models
 BEHAVIOURMETRIKA SPECIAL ISSUE ON ”INTERPRETING NEURAL NETWORK MODELS
, 1999
"... One of the widely acknowledged drawbacks of flexible statistical models is that they are often extremely difficult to interpret. However, if flexible models are constrained to be additive they are much easier to interpret, as each input can be considered independently. The problem with additive mode ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
One of the widely acknowledged drawbacks of flexible statistical models is that they are often extremely difficult to interpret. However, if flexible models are constrained to be additive they are much easier to interpret, as each input can be considered independently. The problem with additive models is that they cannot provide an accurate model if the phenomenon being modeled is not additive. This paper proposes that a tradeoff between accuracy and additivity can be implemented easily in a particular type of flexible model: a Gaussian process model. One can build a series of Gaussian process models which begin with the completely flexible and are constrained to be more and more additive, and thus interpretable. Observations of how the test error and importance of interactions change as the model becomes more additive give insight into the importance and nature of interactions. Models in the series can also be interpreted graphically with a technique for visualizing the effects of inp...
A Simple Trick for Estimating the Weight Decay Parameter
 Neural Networks: Tricks of the Trade
, 1998
"... . We present a simple trick to get an approximate estimate of the weight decay parameter . The method combines early stopping and weight decay, into the estimate = krE(W es )k=k2W es k, where W es is the set of weights at the early stopping point, and E(W ) is the training data fit error. The esti ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
. We present a simple trick to get an approximate estimate of the weight decay parameter . The method combines early stopping and weight decay, into the estimate = krE(W es )k=k2W es k, where W es is the set of weights at the early stopping point, and E(W ) is the training data fit error. The estimate is demonstrated and compared to the standard crossvalidation procedure for selection on one synthetic and four real life data sets. The result is that is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute. The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping. 1 Introduction A regression problem which does not put constraints on the model used is illposed [21], because there are infinitely many functions that can fit a finite set of training data perfectly. Furthermore, real life data sets tend to h...
RBF's, SBF's, TreeBF's, Smoothing Spline ANOVA: Representers and pseudorepresenters for a dictionary of basis functions for penalized likelihood estimates
, 1996
"... This work in progress represents an attempt to combine radial basis functions (RBF's), sigmoidal basis functions (SBF's) and basis functions that may be useful in conjunction with treestructured methods (TreeBF's) under a single `umbrella' of a reproducing kernel Hilbert space. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This work in progress represents an attempt to combine radial basis functions (RBF's), sigmoidal basis functions (SBF's) and basis functions that may be useful in conjunction with treestructured methods (TreeBF's) under a single `umbrella' of a reproducing kernel Hilbert space. Once this is done, several ways of generating a `list' of basis functions in which to solve a penalized likelihood problem suggest themselves. Support vector methods may be used to refine the list. Given such a list, regularized forward selection methods generalizing those suggested by Orr and by Luo and Wahba may be used to fit the model. Large to very large data sets are assumed (n ? 1000). It is envisioned that the approach could prove useful in building models where more than three or four but less than, say ten or fifteen predictor variables are involved, and that the umbrella provides some intuition concerning how the basis functions are related and what they are doing, so as to give some interpretability...
Direct ZeroNorm Minimization for Neural Network Pruning and Training
"... Abstract. Designing a feedforward neural network with optimal topology in terms of complexity (hidden layer nodes and connections between nodes) and training performance has been a matter of considerable concern since the very beginning of neural networks research. Typically, this issue is dealt wi ..."
Abstract
 Add to MetaCart
Abstract. Designing a feedforward neural network with optimal topology in terms of complexity (hidden layer nodes and connections between nodes) and training performance has been a matter of considerable concern since the very beginning of neural networks research. Typically, this issue is dealt with by pruning a fully interconnected network with “many” nodes in the hidden layers, eliminating “superfluous ” connections and nodes. However the problem has not been solved yet and it seems to be even more relevant today in the context of deep learning networks. In this paper we present a method of direct zeronorm minimization for pruning while training a Multi Layer Perceptron. The method employs a cooperative scheme using two swarms of particles and its purpose is to minimize an aggregate function corresponding to the total risk functional. Our discussion highlights relevant computational and methodological issues of the approach that are not apparent and well defined in the literature.
DIMENSIONALITY REDUCTION AND FEATURE SELECTION USING A MIXEDNORM PENALTY FUNCTION
, 2005
"... Dimensionality reduction, which is the process of mapping highdimension patterns to lower dimension subspaces, is a key issues in enhancing the processing efficiency of high dimensional data such as hyperspectral images. Dimensionality reduction has been widely discussed in the areas of data mining ..."
Abstract
 Add to MetaCart
Dimensionality reduction, which is the process of mapping highdimension patterns to lower dimension subspaces, is a key issues in enhancing the processing efficiency of high dimensional data such as hyperspectral images. Dimensionality reduction has been widely discussed in the areas of data mining, image processing, pattern recognition, etc. Because in most situations, many of the dimensions are redundant or unnecessary for the tasks of interest, removing those dimensionality will produce more efficient computation while maintaining the original performance. Dimensionality reduction also reduces the measurement and storage requirements, reduces training and utilization times and it defies the curse of dimensionality to improve classification performance. Feature selection, the process of constructing and selecting the subsets of features that are useful to build a good predictor is of interest for many years. Before Kohavi and John published a special issue on feature selection in 1997, usually no more than 40 features are studied. Ever since then, people started looking at problems with hundreds to tens of thousands of features. Like dimensionality reduction, feature selection reduces the measurement and storage requirements, reduces training and utilization times, and it facilitates
3E381 Neural Computing Lecture 5. Network construction algorithms
"... When we are given a set of data with which to train a neural network, the size of the input vectors tells us how many inputs our network will require. As regards outputs, we may have some freedom in deciding how many, but our choices will be limited. For instance, in a 2class classification problem ..."
Abstract
 Add to MetaCart
When we are given a set of data with which to train a neural network, the size of the input vectors tells us how many inputs our network will require. As regards outputs, we may have some freedom in deciding how many, but our choices will be limited. For instance, in a 2class classification problem we can choose between a single output (where +1 means one class and −1 means the other) or we might choose to have two outputs, where a high output indicates the class. After deciding on the number of outputs, there remain the issues of deciding on how many hidden units and how many layers to use. As we have already seen, there is currently no simple method of deciding these issues and laborious crossvalidation (ie training a network on one data set and checking its behaviour on another data set) is usually employed to determine the bestperforming network out of several proposed networks. This is not very satisfactory and in this lecture we will examine an approach that allows the data to determine network structure in rather a different way. There are two basic approaches. (i) Start with a network that you are sure is big enough to accommodate the problem, train it, and then identify neural elements and connections that can be removed because they are contributing little or nothing to the solution – this is called network pruning. (ii) Commence with a very small network and allow it to grow to accommodate the learning problem – this is called network construction. As we shall see, the network construction methods lead to a variety of network structures, not simply the