Results 1 - 10
of
53
Adaptive model selection using empirical complexities
- Annals of Statistics
, 1999
"... Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a deter ..."
Abstract
-
Cited by 34 (8 self)
- Add to MetaCart
Key words and phrases. Complexity regularization, classi cation, pattern recognition, regression estimation, curve tting, minimum description length. 1 Given n independent replicates of a jointly distributed pair (X; Y) 2R d R, we wish to select from a xed sequence of model classes F1; F2;:::a determin-istic prediction rule f: R d! R whose risk is small. We investigate the possibility of empirically assessing the complexity of each model class, that is, the actual di culty of the estimation problem within each class. The estimated complexities are in turn used to de ne an adaptive model selection procedure, which is based on complexity penalized empirical risk. The available data are divided into two parts. The rst is used to form an empirical cover of each model class, and the second is used to select a candidate rule from each cover based on empirical risk. The covering radii are determined empirically to optimize a tight upper bound on the estimation error.
Mixing Strategies for Density Estimation
- Ann. Statist
"... General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under Kullback-Leibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by m ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under Kullback-Leibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by mixing the proposed ones to be adaptive in terms of statistical risks. A consequence is that under some mild conditions, an asymptotically minimax-rate adaptive estimator exists for a given countable collection of density classes, i.e., a single estimator can be constructed to be simultaneously minimax-rate optimal for all the function classes being considered. A demonstration is given for high-dimensional density estimation on [0; 1] d where the constructed estimator adapts to smoothness and interaction-order over some piecewise Besov classes, and is consistent for all the densities with finite entropy. 1. Introduction. In Recent years, there has been an increasing interest in adaptive fu...
Risk bounds for Statistical Learning
"... We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classi…cation framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weig ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classi…cation framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with other ways of measuring the ”size”of a class of classi…ers than entropy with bracketing as in Tsybakov’s work. In particular we derive new risk bounds for the ERM when the classi…cation rules belong to some VC-class under margin conditions and discuss the optimality of those bounds in a minimax sense.
Combining Different Procedures for Adaptive Regression
- Journal of Multivariate Analysis
, 1998
"... Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we show that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basic ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we show that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basically pays a price only of order 1=n for adaptation over the collection. An interesting consequence is that for a countable collection of classes of regression functions (possibly of completely different characteristics), a minimax-rate adaptive estimator can be constructed such that it automatically converges at the right rate for each of the classes being considered.
Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
- In Advances in Neural Information Processing Systems (NIPS
, 2007
"... by convex risk minimization ..."
Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting
, 2007
"... Abstract—The problem of sparsity pattern or support set recovery refers to estimating the set of nonzero coefficients of an un-3 p known vector 2 based on a set of n noisy observations. It arises in a variety of settings, including subset selection in regression, graphical model selection, signal de ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Abstract—The problem of sparsity pattern or support set recovery refers to estimating the set of nonzero coefficients of an un-3 p known vector 2 based on a set of n noisy observations. It arises in a variety of settings, including subset selection in regression, graphical model selection, signal denoising, compressive sensing, and constructive approximation. The sample complexity of a given method for subset recovery refers to the scaling of the required sample size n as a function of the signal dimension p, sparsity index k (number of non-zeroes in 3), as well as the minimum value min of 3 over its support and other parameters of measurement matrix. This paper studies the information-theoretic limits of sparsity recovery: in particular, for a noisy linear observation model based on random measurement matrices drawn from general Gaussian measurement matrices, we derive both a set of sufficient conditions for exact support recovery using an exhaustive search decoder, as well as a set of necessary conditions that any decoder, regardless of its computational complexity, must satisfy for exact support recovery. This analysis of fundamental limits complements our previous work on sharp thresholds for support set recovery over the same set of random measurement ensembles using the polynomial-time Lasso method (`1-constrained quadratic programming). Index Terms—Compressed sensing, `1-relaxation, Fano’s method, high-dimensional statistical inference, information-theoretic
Model selection via testing: an alternative to (penalized) maximum likelihood estimators
, 2003
"... This paper is devoted to the description and study of a family of estimators, that we shall call T -estimators (T for tests), for minimax estimation and model selection. Their construction is based on former ideas about deriving estimators from some families of tests due to Le Cam (1973 and 1975) ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper is devoted to the description and study of a family of estimators, that we shall call T -estimators (T for tests), for minimax estimation and model selection. Their construction is based on former ideas about deriving estimators from some families of tests due to Le Cam (1973 and 1975) and Birge (1983, 1984a and b) and about complexity based model selection from Barron and Cover (1991). It is
Adaptive Regression by Mixing
- Journal of American Statistical Association
"... Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus auto ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named Adaptive Regression by Mixing (ARM) is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multi-dimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM ...
Pattern classification and learning theory
"... 1.1 A binary classification problem Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
1.1 A binary classification problem Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation is called a class. It is denoted by y and takes values in the set f0; 1g. (For simplicity, we restrict our attention to binary classification.) In pattern recognition, one creates a function g(x) : R d! f0; 1g which represents one's guess of y given x. The mapping g is called a classifier. A classifier errs on x if g(x) 6 = y. To model the learning problem, we introduce a probabilistic setting, and let (X; Y) be an R d \Theta f0; 1g-valued random pair. The random pair (X; Y) may be described in a variety of ways: for example, it is defined by the pair (_; j), where _ is the probability measure for X and j is the regression of Y on X. More precisely, for a Borel-measurable set A ` R d

