Results 1  10
of
81
Sparse multinomial logistic regression: fast algorithms and generalization bounds
 IEEE Trans. on Pattern Analysis and Machine Intelligence
"... Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly larg ..."
Abstract

Cited by 113 (1 self)
 Add to MetaCart
Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learningtheoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a componentwise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in highdimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsitypromoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.
PACBayesian Model Averaging
 In Proceedings of the Twelfth Annual Conference on Computational Learning Theory
, 1999
"... PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of a ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of an arbitrary weighted mixture of concepts that depends on the empirical error of that mixture and the KLdivergence of the mixture from the prior. A simple characterization is also given for the error bound achieved by the optimal weighting. 1
Online Bayes Point Machines
"... We present a new and simple algorithm for learning large margin classi ers that works in a truly online manner. The algorithm generates a linear classi er by averaging the weights associated with several perceptronlike algorithms run in parallel in order to approximate the Bayes point. A rand ..."
Abstract

Cited by 69 (3 self)
 Add to MetaCart
We present a new and simple algorithm for learning large margin classi ers that works in a truly online manner. The algorithm generates a linear classi er by averaging the weights associated with several perceptronlike algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm's performance on online and batch learning settings.
PACBayesian stochastic model selection
 Machine Learning
, 2003
"... Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper giv ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper gives a PACBayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KLdivergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
PACBayesian generalization error bounds for Gaussian process classification
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2002
"... ..."
A PACBayesian Margin Bound for Linear Classifiers
, 2002
"... We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PACBayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential imp ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PACBayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values andfor maximum marginsto a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the wellknown support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.
Generalization Error Bounds for Bayesian Mixture Algorithms
 Journal of Machine Learning Research
, 2003
"... Bayesian approaches to learning and estimation have played a significant role in the Statistics literature over many years. While they are often provably optimal in a frequentist setting, and lead to excellent performance in practical applications, there have not been many precise characterizations ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
Bayesian approaches to learning and estimation have played a significant role in the Statistics literature over many years. While they are often provably optimal in a frequentist setting, and lead to excellent performance in practical applications, there have not been many precise characterizations of their performance for finite sample sizes under general conditions. In this paper we consider the class of Bayesian mixture algorithms, where an estimator is formed by constructing a datadependent mixture over some hypothesis space. Similarly to what is observed in practice, our results demonstrate that mixture approaches are particularly robust, and allow for the construction of highly complex estimators, while avoiding undesirable overfitting effects. Our results, while being datadependent in nature, are insensitive to the underlying model assumptions, and apply whether or not these hold. At a technical level, the approach applies to unbounded functions, constrained only by certain moment conditions. Finally, the bounds derived can be directly applied to nonBayesian mixture approaches such as Boosting and Bagging. 1.
Generalization Bounds for Decision Trees
, 2000
"... We derive a new bound on the error rate for decision trees. The bound depends both on the structure of the tree and the specific sample (not just the size of the sample). This bound is tighter than traditional bounds for unbalanced trees and justifies "compositional" algorithms for constructing deci ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
We derive a new bound on the error rate for decision trees. The bound depends both on the structure of the tree and the specific sample (not just the size of the sample). This bound is tighter than traditional bounds for unbalanced trees and justifies "compositional" algorithms for constructing decision trees. 1 Introduction The problem of overfitting is central to both the theory and practice of machine learning. Intuitively, one overfits by using too many parameters in the concept, e.g, fitting an nth order polynomial to n data points. One underfits by using too few parameters, e.g., fitting a linear curve to clearly quadratic data. The fundamental question is how many parameters, or what concept size, should one allow for a given amount of training data. A standard theoretical approach is to prove a bound on generalization error as a function of the training error and the concept size (or VC dimension). One can then select a concept minimizing this bound, i.e., optimizing a cert...
Self bounding learning algorithms
 In Proceedings of the Eleventh Annual Conference on Computational Learning Theory
, 1998
"... Most of the work which attempts to give bounds on the generalization error of the hypothesis generated by a learning algorithm is based on methods from the theory of uniform convergence. These bounds are apriori bounds that hold for any distribution of examples and are calculated before any data is ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Most of the work which attempts to give bounds on the generalization error of the hypothesis generated by a learning algorithm is based on methods from the theory of uniform convergence. These bounds are apriori bounds that hold for any distribution of examples and are calculated before any data is observed. In this paper we propose a different approach for bounding the generalization error after the data has been observed. A selfbounding learning algorithm is an algorithm which, in addition to the hypothesis that it outputs, outputs a reliable upper bound on the generalization error of this hypothesis. We first explore the idea in the statistical query learning framework of Kearns [10]. After that we give an explicit self bounding algorithm for learning algorithms that are based on local search. 1
Explicit learning curves for transduction and application to clustering and compression algorithms
 Journal of Artificial Intelligence Research
, 2004
"... Inductive learning is based on inferring a general rule from a finite data set and using it to label new data. In transduction one attempts to solve the problem of using a labeled training set to label a set of unlabeled points, which are given to the learner prior to learning. Although transduction ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Inductive learning is based on inferring a general rule from a finite data set and using it to label new data. In transduction one attempts to solve the problem of using a labeled training set to label a set of unlabeled points, which are given to the learner prior to learning. Although transduction seems at the outset to be an easier task than induction, there have not been many provably useful algorithms for transduction. Moreover, the precise relation between induction and transduction has not yet been determined. The main theoretical developments related to transduction were presented by Vapnik more than twenty years ago. One of Vapnik’s basic results is a rather tight error bound for transductive classification based on an exact computation of the hypergeometric tail. While being tight, this bound is given implicitly via a computational routine. Our first contribution is a somewhat looser but explicit characterization of a slightly extended PACBayesian version of Vapnik’s transductive bound. This characterization is obtained using concentration inequalities for the tail of sums of random variables obtained by sampling without replacement. We then derive error bounds for compression schemes such as (transductive) support vector machines and for transduction algorithms based on clustering. The main observation used for deriving these new error bounds and algorithms is that the unlabeled test points, which in the transductive setting are known in advance, can be used in order to construct useful data dependent prior distributions over the hypothesis space. 1.