Results 1 - 10
of
16
The Relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework
"... This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those fram ..."
Abstract
-
Cited by 37 (7 self)
- Add to MetaCart
This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those frameworks are explored. In addition the strengths and weaknesses of those frameworks are compared, and some novel frameworks are suggested (resulting, for example, in a "correction" to the familiar bias-plus-variance formula).
Semiparametric Latent Factor Models
- Workshop on Artificial Intelligence and Statistics 10
, 2005
"... We propose a semiparametric model for regression problems involving multiple response variables. The model makes use of a set of Gaussian processes that are linearly mixed to capture dependencies that may exist among the response variables. We propose an efficient approximate inference scheme for th ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
We propose a semiparametric model for regression problems involving multiple response variables. The model makes use of a set of Gaussian processes that are linearly mixed to capture dependencies that may exist among the response variables. We propose an efficient approximate inference scheme for this semiparametric model whose complexity is linear in the number of training data points. We present experimental results in the domain of multi-joint
Unifying Divergence Minimization and Statistical Inference via Convex Duality
- Proc. of Conf. on Learning Theory (COLT
, 2006
"... Abstract. In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation. Moreover, our treatment leads to stability and convergence b ..."
Abstract
-
Cited by 26 (9 self)
- Add to MetaCart
Abstract. In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently. 1
The supervised learning no-free-lunch Theorems
- In Proc. 6th Online World Conference on Soft Computing in Industrial Applications
, 2001
"... Abstract This paper reviews the supervised learning versions of the no-free-lunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning. ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Abstract This paper reviews the supervised learning versions of the no-free-lunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.
Bayesian Classifiers are Large Margin Hyperplanes in a Hilbert Space
- Machine Learning: Proceedings of the Fifteenth International Conference
"... It is often claimed that one of the main distinctive features of Bayesian Learning Algorithms for neural networks is that they don't simply output one hypothesis, but rather an entire distribution of probability over an hypothesis set: the Bayes posterior. An alternative perspective is that they out ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
It is often claimed that one of the main distinctive features of Bayesian Learning Algorithms for neural networks is that they don't simply output one hypothesis, but rather an entire distribution of probability over an hypothesis set: the Bayes posterior. An alternative perspective is that they output a linear combination of classifiers, whose coefficients are given by Bayes theorem. This can be regarded as a hyperplane in a high-dimensional feature space. We provide a novel theoretical analysis of such classifiers, based on data-dependent VC theory, proving that they can be expected to be large margin hyperplanes in a Hilbert space, and hence to have low effective VCdimension. We also present an extensive experimental study confirming this prediction. This not only explains the remarkable resistance to overfitting exhibited by such classifiers, but also co-locates them in the same class as other systems, such as Support Vector Machines and Adaboost, which have a similar performance. ...
Bayesian Methods for Neural Networks: Theory and Applications
, 1995
"... this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor
Efficient Covariance Matrix Methods for Bayesian Gaussian Processes and Hopfield Neural Networks
, 1999
"... Covariance matrices are important in many areas of neural modelling. In Hopfield networks they are used to form the weight matrix which controls the autoassociative properties of the network. In Gaussian processes, which have been shown to be the infinite neuron limit of many regularised feedforward ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Covariance matrices are important in many areas of neural modelling. In Hopfield networks they are used to form the weight matrix which controls the autoassociative properties of the network. In Gaussian processes, which have been shown to be the infinite neuron limit of many regularised feedforward neural networks, covariance matrices control the form of Bayesian prior distribution over function space. This thesis examines interesting modifications to the standard covariance matrix methods to increase functionality or efficiency of these neural techniques. Firstly the problem of adapting Gaussian process priors to perform regression on switching regimes is tackled. This involves the use of block covariance matrices and Gibbs sampling methods. Then the use of Toeplitz methods is proposed for Gaussian process regression where sampling positions can be chosen. A comparison is made between Hopfield weight matrices, and sample covariances. This allows work on sample covariances to be used ...
Bayesian Non-Linear Modelling with Neural Networks
, 1995
"... this paper is illustrated in figure 6e. If we give a probabilistic interpretation to the model, then we can evaluate the `evidence' for alternative values of the control parameters. Over-complex models turn out to be less probable, and the quantity ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
this paper is illustrated in figure 6e. If we give a probabilistic interpretation to the model, then we can evaluate the `evidence' for alternative values of the control parameters. Over-complex models turn out to be less probable, and the quantity
Generalization Error and the Number of Hidden Units in a Multilayer Perceptron
- In preparation
, 1995
"... Conventional wisdom states that as the number of hidden units H in a supervised regression network is increased, the generalization error, beyond a certain point, gets worse, so that the number of hidden units should be carefully controlled. However, Neal (1995) has shown theoretically that if an ap ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Conventional wisdom states that as the number of hidden units H in a supervised regression network is increased, the generalization error, beyond a certain point, gets worse, so that the number of hidden units should be carefully controlled. However, Neal (1995) has shown theoretically that if an appropriate Gaussian prior is applied to the network weights, then as the number of hidden units is increased to infinity, the complexity of the probabilistic model does not increase. The distribution over functions tends to a Gaussian process whose properties are solely determined by three parameters of the Gaussian prior. Thus in an ideal Bayesian implementation, the number of hidden units should not be important. In this paper we reconcile these two apparently conflicting points of view. We emphasize the importance of the log predictive probability which takes into account error bars on the network's predictions, as a generalization measure, in contrast to the traditional test error. 1 Intr...
Bayesian kernel methods
- LNAI 2600
, 2003
"... Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vecto ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.

