Results 1  10
of
27
Some PACBayesian Theorems
 Machine Learning
, 1998
"... This paper gives PAC guarantees for "Bayesian" algorithms  algorithms that optimize risk minimization expressions involving a prior probability and a likelihood for the training data. PACBayesian algorithms are motivated by a desire to provide an informative prior encoding information about ..."
Abstract

Cited by 102 (4 self)
 Add to MetaCart
This paper gives PAC guarantees for "Bayesian" algorithms  algorithms that optimize risk minimization expressions involving a prior probability and a likelihood for the training data. PACBayesian algorithms are motivated by a desire to provide an informative prior encoding information about the expected experimental setting but still having PAC performance guarantees over all IID settings. The PACBayesian theorems given here apply to an arbitrary prior measure on an arbitrary concept space. These theorems provide an alternative to the use of VC dimension in proving PAC bounds for parameterized concepts. 1 INTRODUCTION Much of modern learning theory can be divided into two seemingly separate areas  Bayesian inference and PAC learning. Both areas study learning algorithms which take as input training data and produce as output a concept or model which can then be tested on test data. In both areas learning algorithms are associated with correctness theorems. PAC correct...
PACBayesian Model Averaging
 In Proceedings of the Twelfth Annual Conference on Computational Learning Theory
, 1999
"... PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of a ..."
Abstract

Cited by 74 (2 self)
 Add to MetaCart
PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of an arbitrary weighted mixture of concepts that depends on the empirical error of that mixture and the KLdivergence of the mixture from the prior. A simple characterization is also given for the error bound achieved by the optimal weighting. 1
PACBayesian stochastic model selection
 Machine Learning
, 2003
"... Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper giv ..."
Abstract

Cited by 58 (2 self)
 Add to MetaCart
Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper gives a PACBayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KLdivergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
Adaptive model selection using empirical complexities
 Annals of Statistics
, 1999
"... Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deter ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deterministic prediction rule f: R d! R whose risk is small. We investigate the possibility of empirically assessing the complexity of each model class, that is, the actual di culty of the estimation problem within each class. The estimated complexities are in turn used to de ne an adaptive model selection procedure, which is based on complexity penalized empirical risk. The available data are divided into two parts. The rst is used to form an empirical cover of each model class, and the second is used to select a candidate rule from each cover based on empirical risk. The covering radii are determined empirically to optimize a tight upper bound on the estimation error.
Nonparametric time series prediction through adaptive model selection
 Machine Learning
, 2000
"... Abstract. We consider the problem of onestep ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and ada ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
Abstract. We consider the problem of onestep ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and adapt the uniform convergence framework of Vapnik and Chervonenkis to the problem of time series prediction, obtaining finite sample bounds. Furthermore, by allowing both the model complexity and memory size to be adaptively determined by the data, we derive nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik. All our results are derived for general L p error measures, and apply to both exponentially and algebraically mixing processes.
Minimaxoptimal classification with dyadic decision trees
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Decision trees are among the most popular types of classifiers, with interpretability and ease of implementation being among their chief attributes. Despite the widespread use of decision trees, theoretical analysis of their performance has only begun to emerge in recent years. In this paper it is ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
Decision trees are among the most popular types of classifiers, with interpretability and ease of implementation being among their chief attributes. Despite the widespread use of decision trees, theoretical analysis of their performance has only begun to emerge in recent years. In this paper it is shown that a new family of decision trees, dyadic decision trees (DDTs), attain nearly optimal (in a minimax sense) rates of convergence for a broad range of classification problems. Furthermore, DDTs are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; and (3) reject irrelevant features. DDTs are constructed by penalized empirical risk minimization using a new datadependent penalty and may be computed exactly with computational complexity that is nearly linear in the training sample size. DDTs are the first classifier known to achieve nearly optimal rates for the diverse class of distributions studied here while also being practical and implementable. This is also the first study (of which we are aware) to consider rates for adaptation to intrinsic data dimension and relevant features.
Learning minimum volume sets
 J. Machine Learning Res
, 2006
"... Given a probability measure P and a reference measure µ, one is often interested in the minimum µmeasure set with Pmeasure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence region ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
Given a probability measure P and a reference measure µ, one is often interested in the minimum µmeasure set with Pmeasure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
A NeymanPearson approach to statistical learning
 IEEE Trans. Inform. Theory
, 2005
"... The NeymanPearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α> 0, the NeymanPearson lemma specifies the most powerful test of size α, but assumes the distributions for each h ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
The NeymanPearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α> 0, the NeymanPearson lemma specifies the most powerful test of size α, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed training examples from each hypothesis. Building on a “fundamental lemma ” of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NPERM) and structural risk minimization (NPSRM), and prove performance guarantees for both. General conditions are given under which NPSRM leads to strong universal consistency. We also apply NPSRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NPSRM for histograms and dyadic decision trees. 1