Results 1  10
of
13
An introduction to collective intelligence
 Handbook of Agent technology. AAAI
, 1999
"... ..."
(Show Context)
The Relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework
"... This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those fram ..."
Abstract

Cited by 47 (9 self)
 Add to MetaCart
This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those frameworks are explored. In addition the strengths and weaknesses of those frameworks are compared, and some novel frameworks are suggested (resulting, for example, in a "correction" to the familiar biasplusvariance formula).
The supervised learning nofreelunch Theorems
 In Proc. 6th Online World Conference on Soft Computing in Industrial Applications
, 2001
"... Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning. ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.
Linearly combining density estimators via stacking
 Machine Learning
, 1999
"... This paper presents experimental results with both real and artificial data on using the technique of stacking to combine unsupervised learning algorithms. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for nonparametric multivariat ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
This paper presents experimental results with both real and artificial data on using the technique of stacking to combine unsupervised learning algorithms. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for nonparametric multivariate density estimation. The method is found to outperform other strategies such as choosing the single best model based on crossvalidation, combining with uniform weights, and even using the single best model chosen by “cheating ” and examining the test set. We also investigate in detail how the utility of stacking changes when one of the models being combined generated the data; how the stacking coefficients of the models compare to the relative frequencies with which crossvalidation chooses among the models; visualization of combined “effective ” kernels; and the sensitivity of stacking to overfitting as model complexity increases. In an extended version of this paper we also investigate how stacking performs using L1 and L2 performance measures (for which one must know the true density) rather than loglikelihood (Smyth and Wolpert 1998). 1
The Use of a Bayesian Neural Network Model for Classification Tasks
, 1997
"... This thesis deals with a Bayesian neural network model. The focus is on how to use the model for automatic classification, i.e. on how to train the neural network to classify objects from some domain, given a database of labeled examples from the domain. The original Bayesian neural network is a one ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
This thesis deals with a Bayesian neural network model. The focus is on how to use the model for automatic classification, i.e. on how to train the neural network to classify objects from some domain, given a database of labeled examples from the domain. The original Bayesian neural network is a onelayer network implementing a naive Bayesian classifier. It is based on the assumption that different attributes of the objects appear independently of each other. This work has been aimed at extending the original Bayesian neural network model, mainly focusing on three different aspects. First the model is extended to a multilayer network, to relax the independence requirement. This is done by introducing a hidden layer of complex columns, groups of units which take input from the same set of input attributes. Two different types of complex column structures in the hidden layer are studied and compared. An information theoretic measure is used to decide which input attributes to consider toget...
A Predictive Theory of Games
, 2006
"... Conventional noncooperative game theory hypothesizes that the joint (mixed) strategy of a set of reasoning players in a game will necessarily satisfy an “equilibrium concept”. All other joint strategies are considered impossible. Moroever, often the number of joint strategies satisfying that equilib ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Conventional noncooperative game theory hypothesizes that the joint (mixed) strategy of a set of reasoning players in a game will necessarily satisfy an “equilibrium concept”. All other joint strategies are considered impossible. Moroever, often the number of joint strategies satisfying that equilibrium concept has measure zero. (Indeed, this is often considered a desirable property of an equilibrium concept.) Under this hypothesis the only issue is what equilibrium concept is “correct”. This hypothesis violates the firstprinciples arguments underlying probability theory. Indeed, probability theory renders moot the controversy over what equilibrium concept is correct — while in general there are joint (mixed) strategies with zero probability, in general the set {strategies with nonzero probability} has measure greater than zero. Rather than a firstprinciples derivation of an equilibrium concept, game theory requires a firstprinciples derivation of a distribution over joint strategies.
Any Two Learning Algorithms Are (Almost) Exactly Identical
, 2000
"... This paper shows that if one is provided with a loss function, it can be used in a natural way to specify a distance measure quantifying the similarityofany two supervised learning algorithms, even nonparametric algorithms. Intuitively, this measure gives the fraction of targets and training se ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper shows that if one is provided with a loss function, it can be used in a natural way to specify a distance measure quantifying the similarityofany two supervised learning algorithms, even nonparametric algorithms. Intuitively, this measure gives the fraction of targets and training sets for which the expected performance of the two algorithms differs significantly. Bounds on the value of this distance are calculated for the case of binary outputs and 01 loss, indicating that anytwo learning algorithms are almost exactly identical for such scenarios. As an example, for any two algorithms B,even for small input spaces and training sets, for less than 2e of all targets will the difference between A's and B's generalization performance exceed 1%. In particular, this is true if B is bagging applied to A, or boosting applied to A. These bounds can be viewed alternatively as telling us, for example, that the simple English phrase "I expect that algorithm will generalize from the training set with an accuracy of at least 75% on the rest of the target" conveys 20,000 bytes of information concerning the target. The paper ends by discussing some of the subtleties of extending the distance measure to give a full (nonparametric) differential geometry of the manifold of learning algorithms.
Predictive Game Theory
, 2005
"... Conventional noncooperative game theory hypothesizes that the joint (mixed) strategy of a set of reasoning players in a game will necessarily satisfy an “equilibrium concept”. The number of joint strategies satisfying that equilibrium concept has measure zero, and all other joint strategies are cons ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Conventional noncooperative game theory hypothesizes that the joint (mixed) strategy of a set of reasoning players in a game will necessarily satisfy an “equilibrium concept”. The number of joint strategies satisfying that equilibrium concept has measure zero, and all other joint strategies are considered impossible. Under this hypothesis the only issue is what equilibrium concept is “correct”. This hypothesis violates the firstprinciples arguments underlying probability theory. Indeed, probability theory renders moot the controversy over what equilibrium concept is correct — in general all joint (mixed) strategies in a set with nonzero measure can arise with nonzero probability. Rather than a firstprinciples derivation of an equilibrium concept, game theory requires a firstprinciples derivation of a distribution over joint strategies. However say one wishes to predict a single joint strategy from that
Bayesian Integration of Rule Models
"... Although Bayesian model averaging (BMA) is in principle the optimal method for combining learned models, it has received relatively little attention in the machine learning literature. This article describes an extensive empirical study of the application of BMA to rule induction. BMA is applied to ..."
Abstract
 Add to MetaCart
(Show Context)
Although Bayesian model averaging (BMA) is in principle the optimal method for combining learned models, it has received relatively little attention in the machine learning literature. This article describes an extensive empirical study of the application of BMA to rule induction. BMA is applied to a variety of tasks and compared with more ad hoc alternatives like bagging. In each case, BMA typically leads to higher error rates than the ad hoc alternative. This is found to be due to the exponential sensitivity of the likelihood to small variations in the sample, leading to effectively very little averaging being performed even when all models have similar error rates. Coupled with the generation of many models, this causes BMA to have a strong tendency to overfit. An attempt to combat this problem using carefullydesigned priors is described. These and further experiments suggest that methods like bagging succeed not because they approximate the optimal BMA procedure better than a sing...