Results 11  20
of
112
Rigorous learning curve bounds from statistical mechanics
 Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract

Cited by 53 (9 self)
 Add to MetaCart
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the VapnikChervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problemspecific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distributionindependent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
A few notes on Statistical Learning Theory
, 2003
"... this article is on the theoretical side and not on the applicative one; hence, we shall not present examples which may be interesting from the practical point of view but have little theoretical significance. This survey is far from being complete and it focuses on problems the author finds interest ..."
Abstract

Cited by 52 (10 self)
 Add to MetaCart
this article is on the theoretical side and not on the applicative one; hence, we shall not present examples which may be interesting from the practical point of view but have little theoretical significance. This survey is far from being complete and it focuses on problems the author finds interesting (an opinion which is not necessarily shared by the majority of the learning community). Relevant books which present a more evenly balanced approach are, for example [1, 4, 35, 36] The starting point of our discussion is the formulation of the learning problem. Consider a class G, consisting of real valued functions defined on a space #, and assume that each g G maps # into [0, 1]. Let T be an unknown function, T : # [0, 1] and set to be an unknown probability measure on #
On the Rate of Convergence of Regularized Boosting Classifiers
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite ..."
Abstract

Cited by 46 (10 self)
 Add to MetaCart
A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite a large class of distributions, the probability of error converges to the Bayes risk at a rate faster than n (V+2)/(4(V+1)) where V is the VC dimension of the "base" class whose elements are combined by boosting methods to obtain an aggregated classifier. The dimensionindependent nature of the rates may partially explain the good behavior of these methods in practical problems. Under Tsybakov's noise condition the rate of convergence is even faster. We investigate the conditions necessary to obtain such rates for different base classes. The special case of boosting using decision stumps is studied in detail. We characterize the class of classifiers realizable by aggregating decision stumps.
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporaldifference learning  a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
Adaptive model selection using empirical complexities
 Annals of Statistics
, 1999
"... Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deter ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deterministic prediction rule f: R d! R whose risk is small. We investigate the possibility of empirically assessing the complexity of each model class, that is, the actual di culty of the estimation problem within each class. The estimated complexities are in turn used to de ne an adaptive model selection procedure, which is based on complexity penalized empirical risk. The available data are divided into two parts. The rst is used to form an empirical cover of each model class, and the second is used to select a candidate rule from each cover based on empirical risk. The covering radii are determined empirically to optimize a tight upper bound on the estimation error.
Clustering for EdgeCost Minimization
"... Leonard J. Schulman College of Computing Georgia Institute of Technology Atlanta GA 303320280 ABSTRACT We address the problem of partitioning a set of n points into clusters, so as to minimize the sum, over all intracluster pairs of points, of the cost associated with each pair. We obtain a ra ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
Leonard J. Schulman College of Computing Georgia Institute of Technology Atlanta GA 303320280 ABSTRACT We address the problem of partitioning a set of n points into clusters, so as to minimize the sum, over all intracluster pairs of points, of the cost associated with each pair. We obtain a randomized approximation algorithm for this problem, for the cost functions ` 2 2 ; `1 and `2 , as well as any cost function isometrically embeddable in ` 2 2 .
Piecewisepolynomial regression trees
 Statistica Sinica
, 1994
"... A nonparametric function 1 estimation method called SUPPORT (“Smoothed and Unsmoothed PiecewisePolynomial Regression Trees”) is described. The estimate is typically made up of several pieces, each piece being obtained by fitting a polynomial regression to the observations in a subregion of the data ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
A nonparametric function 1 estimation method called SUPPORT (“Smoothed and Unsmoothed PiecewisePolynomial Regression Trees”) is described. The estimate is typically made up of several pieces, each piece being obtained by fitting a polynomial regression to the observations in a subregion of the data space. Partitioning is carried out recursively as in a treestructured method. If the estimate is required to be smooth, the polynomial pieces may be glued together by means of weighted averaging. The smoothed estimate is thus obtained in three steps. In the first step, the regressor space is recursively partitioned until the data in each piece are adequately fitted by a polynomial of a fixed order. Partitioning is guided by analysis of the distributions of residuals and crossvalidation estimates of prediction mean square error. In the second step, the data within a neighborhood of each partition are fitted by a polynomial. The final estimate of the regression function is obtained by averaging the polynomial pieces, using smooth weight functions each of which diminishes rapidly to zero outside its associated partition. Estimates of derivatives of the regression function may be
Active learning using arbitrary binary valued queries
 Machine Learning
, 1993
"... Abstract. The original and most widely studied PAC model for learning assumes a passive learner in the sense that the learner plays no role in obtaining information about the unknown concept. That is, the samples are simply drawn independently from some probability distribution. Some work has been d ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Abstract. The original and most widely studied PAC model for learning assumes a passive learner in the sense that the learner plays no role in obtaining information about the unknown concept. That is, the samples are simply drawn independently from some probability distribution. Some work has been done on studying more powerful oracles and how they affect learnability. To find bounds on the improvement in sample complexity that can be expected from using oracles, we consider active learning in the sense that the learner has complete control over the information received. Specifically, we allow the learner to ask arbitrary yes/no questions. We consider both active learning under a fixed distribution and distributionfree active learning. In the case of active learning, the underlying probability distribution is used only to measure distance between concepts. For learnability with respect to a fixed distribution, active learning does not enlarge the set of learnable concept classes, but can improve the sample complexity. For distributionfree learning, it is shown that a concept class is actively learnable iff it is finite, so that active learning is in fact less powerful than the usual passive learning model. We also consider a form of distributionfree learning in which the learner knows the distribution being used, so that "distributionfree" refers only to the requirement that a bound on the number of queries can be obtained uniformly over all distributions. Even with the side information of the distribution being used, a concept class is actively learnable iff it has finite VC dimension, so that active learning with the side information still does not enlarge the set of learnable concept classes. Keywords: PAClearning, active learning, queries, oracles 1.
Bounds for the uniform deviation of empirical measures
 Journal of Multivariate Analysis
, 1982
"... If x,)...) X, are independent identically distributed Rdvalued random vectors with probability measure p and empirical probability measure p,, and if QZ is a subset of the Bore1 sets on Rd, then we show that P{sup,, ~ ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
If x,)...) X, are independent identically distributed Rdvalued random vectors with probability measure p and empirical probability measure p,, and if QZ is a subset of the Bore1 sets on Rd, then we show that P{sup,, ~