Results 11 - 20
of
120
A Bayesian Committee Machine
- NEURAL COMPUTATION
, 2000
"... The Bayesian committee machine (BCM) is a novel approach to combining estimators which were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators the main foci are Gaussian process regression and related systems such as regularization networks a ..."
Abstract
-
Cited by 60 (7 self)
- Add to MetaCart
The Bayesian committee machine (BCM) is a novel approach to combining estimators which were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators the main foci are Gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly, we nd that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator. The BCM also provides a new solution for online learning with potential applications to data mining. We apply the BCM to systems with fixed basis functions and discuss its relationship to Gaussian process regression. Finally, we also show how the ideas behind the BCM can be applied in a non-Bayesian setting to extend the input dependent combination of estimators.
Time Series Prediction by Using a Connectionist Network with Internal Delay Lines
- Time Series Prediction
, 1994
"... A neural network architecture, which models synapses as Finite Impulse Response (FIR) linear filters, is discussed for use in time series prediction. Analysis and methodology are detailed in the context of the Santa Fe Institute Time Series Prediction Competition. Results of the competition show tha ..."
Abstract
-
Cited by 55 (4 self)
- Add to MetaCart
A neural network architecture, which models synapses as Finite Impulse Response (FIR) linear filters, is discussed for use in time series prediction. Analysis and methodology are detailed in the context of the Santa Fe Institute Time Series Prediction Competition. Results of the competition show that the FIR network performed remarkably well on a chaotic laser intensity time series. 1 Introduction The goal of time series prediction or forecasting can be stated succinctly as follows: given a sequence y(1); y(2); : : : y(N) up to time N , find the continuation y(N + 1); y(N + 2)::: The series may arise from the sampling of a continuous time system, and be either stochastic or deterministic in origin. The standard prediction approach involves constructing an underlying model which gives rise to the observed sequence. In the oldest and most studied method, which dates back to Yule [1], a linear autoregression (AR) is fit to the data: y(k) = T X n=1 a(n)y(k \Gamma n) + e(k) = y(k) + ...
Fast Exact Multiplication by the Hessian
- Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)|_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions
, 1997
"... The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of low-temperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of low-temperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys ’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics.
An Empirical Investigation of Brute Force to choose Features, Smoothers and Function Approximators
- Computational Learning Theory and Natural Learning Systems
, 1992
"... The generalization error of a function approximator, feature set or smoother can be estimated directly by the leave-one-out cross-validation error. For memory-based methods, this is computationally feasible. We describe an initial version of a general memory-based learning system (GMBL): a large col ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
The generalization error of a function approximator, feature set or smoother can be estimated directly by the leave-one-out cross-validation error. For memory-based methods, this is computationally feasible. We describe an initial version of a general memory-based learning system (GMBL): a large collection of learners brought into a widely applicable machine-learning family. We present ongoing investigations into search algorithms which, given a dataset, find the family members and features that generalize best. We also describe GMBL's application to two noisy, difficult problems---predicting car engine emissions from pressure waves, and controlling a robot billiards player with redundant state variables. 1 Introduction The main engineering benefit of machine learning is its application to autonomous systems in which human decision making is minimized. Function approximation plays a large and successful role in this process. However, many other human decisions are needed even for si...
On the Relationship Between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions
- NEURAL COMPUTATION
, 1996
"... Feedforward networks are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
Feedforward networks are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the generalization error can be decomposed in two terms: the approximation error, due to the insufficient representational capacity of a finite sized network, and the estimation error, due to insufficient information about the target function because of the finite number of samples. We then consider the problem of approximating functions belonging to certain Sobolev spaces with Gaussian Radial Basis Functions. Using the above mentioned decomposition we bound the generalization error in terms of the number of basis functions and number of examples. While the bound that we derive is specific for Radial Basis Functions, a number of observations deriving from it apply to any approximation t...
Discovering Neural Nets With Low Kolmogorov Complexity And High Generalization Capability
- Neural Networks
, 1997
"... Many neural net learning algorithms aim at finding "simple" nets to explain training data. The expectation is: the "simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for "simplicity" that lack the power, universali ..."
Abstract
-
Cited by 41 (23 self)
- Add to MetaCart
Many neural net learning algorithms aim at finding "simple" nets to explain training data. The expectation is: the "simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for "simplicity" that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoff's algorithmic probability. Likewise, most previous approaches (especially those of the "Bayesian" kind) suffer from the problem of choosing appropriate priors. This paper addresses both issues. It first reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the Solomonoff-Levin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for finding "algorithmically simple" problem solutions with high generalization capability. The method is based on Levin complexity (a time-bounded generalization of Kolmogorov comple...
A generalized approximate cross validation for smoothing splines with non-Gaussian data’, Statistica Sinica 6
, 1996
"... Abstract: In this paper, we propose a Generalized Approximate Cross Validation (GACV) function for estimating the smoothing parameter in the penalized log likelihood regression problem with non-Gaussian data. This GACV is obtained by, first, obtaining an approximation to the leaving-out-one function ..."
Abstract
-
Cited by 39 (16 self)
- Add to MetaCart
Abstract: In this paper, we propose a Generalized Approximate Cross Validation (GACV) function for estimating the smoothing parameter in the penalized log likelihood regression problem with non-Gaussian data. This GACV is obtained by, first, obtaining an approximation to the leaving-out-one function based on the negative log likelihood, and then, in a step reminiscent of that used to get from leaving-outone cross validation to GCV in the Gaussian case, we replace diagonal elements of certain matrices by 1/n times the trace. A numerical simulation with Bernoulli data is used to compare the smoothing parameter λ chosen by this approximation procedure with the λ chosen from the two most often used algorithms based on the generalized cross validation procedure (O’Sullivan et al. (1986), Gu (1990, 1992)). In the examples here, the GACV estimate produces a better fit of the truth in term of minimizing the Kullback-Leibler distance. Figures suggest that the GACV curve may be an approximately unbiased estimate of the Kullback-Leibler distance in the Bernoulli data case; however, a theoretical proof is yet to be found.
A New Metric-Based Approach to Model Selection
- In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97
, 1997
"... We introduce a new approach to model selection that performs better than the standard complexitypenalization and hold-out error estimation techniques in many cases. The basic idea is to exploit the intrinsic metric structure of a hypothesis space, as determined by the natural distribution of unlabel ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
We introduce a new approach to model selection that performs better than the standard complexitypenalization and hold-out error estimation techniques in many cases. The basic idea is to exploit the intrinsic metric structure of a hypothesis space, as determined by the natural distribution of unlabeled training patterns, and use this metric as a reference to detect whether the empirical error estimates derived from a small (labeled) training sample can be trusted in the region around an empirically optimal hypothesis. Using simple metric intuitions we develop new geometric strategies for detecting overfitting and performing robust yet responsive model selection in spaces of candidate functions. These new metric-based strategies dramatically outperform previous approaches in experimental studies of classical polynomial curve fitting. Moreover, the technique is simple, efficient, and can be applied to most function learning tasks. The only requirement is access to an auxiliary collection ...
Finite Impulse Response Neural Networks for Autoregressive Time Series Prediction
, 1993
"... A neural network architecture, which models synapses as Finite Impulse Response (FIR) linear filters, is discussed for use in time series prediction. Analysis and methodology are detailed in the context of the Santa Fe Institute Time Series Prediction Competition. Results of the competition show tha ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
A neural network architecture, which models synapses as Finite Impulse Response (FIR) linear filters, is discussed for use in time series prediction. Analysis and methodology are detailed in the context of the Santa Fe Institute Time Series Prediction Competition. Results of the competition show that the FIR network performed remarkably well on a chaotic laser intensity time series. 1 Introduction The goal of time series prediction or forecasting can be stated succinctly as follows: given a sequence y(1); y(2); : : : y(N) up to time N , find the continuation y(N + 1); y(N + 2)::: The series may arise from the sampling of a continuous time system, and be either stochastic or deterministic in origin. The standard prediction approach involves constructing an underlying model which gives rise to the observed sequence. In the oldest and most studied method, which dates back to Yule [1], a linear autoregression (AR) is fit to the data: y(k) = T X n=1 a(n)y(k \Gamma n) + e(k) = y(k) + e...

