Results 1  10
of
155
Empirical margin distributions and bounding the generalization error of combined classifiers
 Ann. Statist
, 2002
"... Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such ..."
Abstract

Cited by 112 (8 self)
 Add to MetaCart
Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of ℓ1norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Lévy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.
Large Sample Sieve Estimation of SemiNonparametric Models
 Handbook of Econometrics
, 2007
"... Often researchers find parametric models restrictive and sensitive to deviations from the parametric specifications; seminonparametric models are more flexible and robust, but lead to other complications such as introducing infinite dimensional parameter spaces that may not be compact. The method o ..."
Abstract

Cited by 92 (17 self)
 Add to MetaCart
Often researchers find parametric models restrictive and sensitive to deviations from the parametric specifications; seminonparametric models are more flexible and robust, but lead to other complications such as introducing infinite dimensional parameter spaces that may not be compact. The method of sieves provides one way to tackle such complexities by optimizing an empirical criterion function over a sequence of approximating parameter spaces, called sieves, which are significantly less complex than the original parameter space. With different choices of criteria and sieves, the method of sieves is very flexible in estimating complicated econometric models. For example, it can simultaneously estimate the parametric and nonparametric components in seminonparametric models with or without constraints. It can easily incorporate prior information, often derived from economic theory, such as monotonicity, convexity, additivity, multiplicity, exclusion and nonnegativity. This chapter describes estimation of seminonparametric econometric models via the method of sieves. We present some general results on the large sample properties of the sieve estimates, including consistency of the sieve extremum estimates, convergence rates of the sieve Mestimates, pointwise normality of series estimates of regression functions, rootn asymptotic normality and efficiency of sieve estimates of smooth functionals of infinite dimensional parameters. Examples are used to illustrate the general results.
Multiscale likelihood analysis and complexity penalized estimation
 Annals of Statistics
"... We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L 2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and ..."
Abstract

Cited by 57 (20 self)
 Add to MetaCart
We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L 2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and nearoptimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical waveletbased estimators can be obtained in a single, unified framework
Nearideal model selection by ℓ1 minimization
, 2008
"... We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse su ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate Xβ using a computationally tractable algorithm. We show that in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error one would achieve with an oracle supplying perfect information about which variables should be included in the model and which variables should not. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where Xβ is a sparse or nearly sparse superposition of variables, but not in all cases. Our results are nonasymptotic and widely applicable since they simply require that pairs of predictor variables are not too collinear.
Risk bounds for Statistical Learning
"... We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classi…cation framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weig ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classi…cation framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with other ways of measuring the ”size”of a class of classi…ers than entropy with bracketing as in Tsybakov’s work. In particular we derive new risk bounds for the ERM when the classi…cation rules belong to some VCclass under margin conditions and discuss the optimality of those bounds in a minimax sense.
Rademacher Processes And Bounding The Risk Of Function Learning
 High Dimensional Probability II
, 1999
"... We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific propertie ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific properties of the function class. Using Talagrand's type concentration inequalities for empirical and Rademacher processes, we show that the bounds hold with high probability that decreases exponentially fast when the sample size grows. In typical situations that are frequently encountered in the theory of function learning, the bounds give nearly optimal rate of convergence of the risk to zero. 1. Local Rademacher norms and bounds on the risk: main results Let (S; A) be a measurable space and let F be a class of Ameasurable functions from S into [0; 1]: Denote P(S) the set of all probability measures on (S; A): Let f 0 2 F be an unknown target function. Given a probability measure P 2 P(S) (also unknown), let (X 1 ; : : : ; Xn ) be an i.i.d. sample in (S; A) with common distribution P (defined on a probability space(\Omega ; \Sigma; P)). In computer learning theory, the problem of estimating f 0 ; based on the labeled sample (X 1 ; Y 1 ); : : : ; (Xn ; Yn ); where Y j := f 0 (X j ); j = 1; : : : ; n; is referred to as function learning problem. The so called concept learning is a special case of function learning. In this case, F := fI C : C 2 Cg; where C ae A is called a class of concepts (see Vapnik (1998), Vidyasagar (1996), Devroye, Gyorfi and Lugosi (1996) for the account on statistical learning theory). The goal of function learning is to find an estimate
Adaptive Regression by Mixing
 Journal of American Statistical Association
"... Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus auto ..."
Abstract

Cited by 39 (7 self)
 Add to MetaCart
Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named Adaptive Regression by Mixing (ARM) is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multidimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM ...
Adaptive model selection using empirical complexities
 Annals of Statistics
, 1999
"... Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deter ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deterministic prediction rule f: R d! R whose risk is small. We investigate the possibility of empirically assessing the complexity of each model class, that is, the actual di culty of the estimation problem within each class. The estimated complexities are in turn used to de ne an adaptive model selection procedure, which is based on complexity penalized empirical risk. The available data are divided into two parts. The rst is used to form an empirical cover of each model class, and the second is used to select a candidate rule from each cover based on empirical risk. The covering radii are determined empirically to optimize a tight upper bound on the estimation error.