## When do stepwise algorithms meet subset selection criteria? (2007)

### Cached

### Download Links

Citations: | 9 - 3 self |

### BibTeX

@MISC{Huo07whendo,

author = {Xiaoming Huo and Xuelei (Sherry) Ni},

title = {When do stepwise algorithms meet subset selection criteria?},

year = {2007}

}

### OpenURL

### Abstract

Recent results in homotopy and solution paths demonstrate that certain well-designed greedy algorithms, with a range of values of the algorithmic parameter, can provide solution paths to a sequence of convex optimization problems. On the other hand, in regression many existing criteria in subset selection (including Cp, AIC, BIC, MDL, RIC, etc.) involve optimizing an objective function that contains a counting measure. The two optimization problems are formulated as (P1) and (P0) in the present paper. The latter is generally combinatoric and has been proven to be NP-hard. We study the conditions under which the two optimization problems have common solutions. Hence, in these situations a stepwise algorithm can be used to solve the seemingly unsolvable problem. Our main result is motivated by recent work in sparse representation, while two others emerge from different angles: a direct analysis of sufficiency and necessity and a condition on the mostly correlated covariates. An extreme example connected with least angle regression is of independent interest.

### Citations

3274 | Convex Analysis
- Rockafellar
- 1996
(Show Context)
Citation Context ...he entire solution path in a large class of problems; see [24] and the references therein. The homotopy continuation method [39] andthesubdifferential are the key technical tools in this development. =-=[42]-=- and[37] are useful references. 2.3. Case studies. We present two cases that have been instructive to us. 2.3.1. An extreme example. We construct an extreme example, in which a sophisticated stepwise ... |

2320 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...s [denoted by RSS(x)] and where the constant λ0 depends on the criterion. Some well-known results are the Akaike information criterion (AIC) [1], Cp [20, 30], the Bayesian information criterion (BIC) =-=[44]-=-, minimum description length (MDL) (see the equivalence between BIC and MDL in [25], Section 7.8), the risk inflation criterion (RIC) [15] and so on. We refer to George [19] for the details. In this p... |

2049 | A wavelet tour of signal processing - Mallat - 1998 |

1673 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ...ern appearance of the formulation (P1). The idea of relaxation has been studied extensively in the literature on sparse representation. Some representative papers are (roughly in chronological order) =-=[4, 8, 11, 23, 6, 46, 47, 16, 7, 29, 22]-=-, and so on. A full review is well beyond the scope of this paper. The problem of sparse representation has a different emphasis, involving the derivation of a priori conditions instead of a posterior... |

1242 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ...under (P0), where ‖y − �x‖2 2 is the residual sum of squares [denoted by RSS(x)] and where the constant λ0 depends on the criterion. Some well-known results are the Akaike information criterion (AIC) =-=[1]-=-, Cp [20, 30], the Bayesian information criterion (BIC) [44], minimum description length (MDL) (see the equivalence between BIC and MDL in [25], Section 7.8), the risk inflation criterion (RIC) [15] a... |

1036 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...et application because they can identify more cases of equivalence. Subset selection has applications in feature selection. There are two major approaches in feature selection: filter and wrapper; see=-=[27, 28, 32]-=- for details. Our formulations are closely related to wrappers. A recent survey paper by Fan and Li [13] gives an excellent overview of the statistical challenges associated with highdimensional data,... |

750 | Least angle regression - Efron, Hastie, et al. - 2004 |

700 |
The elements of statistical learning: Data mining, inference, and prediction
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...-known results are the Akaike information criterion (AIC) [1], Cp [20, 30], the Bayesian information criterion (BIC) [44], minimum description length (MDL) (see the equivalence between BIC and MDL in =-=[25]-=-, Section 7.8), the risk inflation criterion (RIC) [15] and so on. We refer to George [19] for the details. In this paper, the “subset selection criteria” that appears in the title encompasses all of ... |

681 | Adapting to unknown smoothness via wavelet shrinkage
- Donoho, Johnstone
- 1995
(Show Context)
Citation Context ...the solutions differ by a constant λ1/2. A partial list of references for such a result includes [10, 38, 45], and many more. For readers who are familiar with soft-thresholding and hard-thresholding =-=[9]-=-, this result should not come as a surprise. The above two examples collectively motivate us to pursue sufficient conditions that guarantee common support in the solutions of (P0) and(P1). 3. Main res... |

527 | Greed is good: Algorithmic results for sparse approximation
- Tropp
- 2004
(Show Context)
Citation Context ...ern appearance of the formulation (P1). The idea of relaxation has been studied extensively in the literature on sparse representation. Some representative papers are (roughly in chronological order) =-=[4, 8, 11, 23, 6, 46, 47, 16, 7, 29, 22]-=-, and so on. A full review is well beyond the scope of this paper. The problem of sparse representation has a different emphasis, involving the derivation of a priori conditions instead of a posterior... |

382 |
Model selection and multimodel inference. A practical information-theoretic approach
- Burnham, Anderson
- 1998
(Show Context)
Citation Context ...(2) the size of the set � is small. 2.1. Subset selection criteria and (P0). There exists an extensive body of literature on the criteria regarding subset selection. Miller [31], Burnham and Anderson =-=[2]-=- and George [19] all give excellent reviews. An interesting fact is that a majority of these criterion can be unified under (P0), where ‖y − �x‖2 2 is the residual sum of squares [denoted by RSS(x)] a... |

366 | Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization
- Donoho, Elad
- 2003
(Show Context)
Citation Context ...ern appearance of the formulation (P1). The idea of relaxation has been studied extensively in the literature on sparse representation. Some representative papers are (roughly in chronological order) =-=[4, 8, 11, 23, 6, 46, 47, 16, 7, 29, 22]-=-, and so on. A full review is well beyond the scope of this paper. The problem of sparse representation has a different emphasis, involving the derivation of a priori conditions instead of a posterior... |

357 | Uncertainty principles and ideal atomic decomposition
- Donoho, Huo
- 2001
(Show Context)
Citation Context ...t ‖x‖0 (resp., ‖x‖1) is a quasi-norm (resp., norm) in R m . In the literature on sparse presentation, these are called the ℓ0-norm and the ℓ1-norm, respectively. The notation (P0)and(P1) alsoappearsin=-=[8]-=-, with slightly different definitions. Received July 2005; revised July 2006. 1 Supported by in part NSF Grant DMS-01-40587. AMS 2000 subject classification. 62J07. Key words and phrases. Subset selec... |

357 | Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition - Pati, Rezaiifar, et al. - 1993 |

329 |
Regression shrinkage and selection via the
- Tibshirani
- 1996
(Show Context)
Citation Context ...of our knowledge, no paper has formally presented a proof of this yet. At the same time, (P1), which has a long history that will be reviewed later, is the mathematical problem that is called upon in =-=[45]-=-. Recent advances (details and references are provided in Section 2.2) demonstrate that some stepwise algorithms (e.g., [10, 38, 39]) reveal the solution paths of problem (P1) while the parameter λ1 t... |

318 |
Sparse approximate solutions to linear systems
- Natarajan
- 1995
(Show Context)
Citation Context ...t, unless the model matrix � possesses some special structure. In fact, solving (P0) is an NP-hard problem! The following theorem can be considered as an extension of a result originally presented in =-=[33]-=-. The proof of the theorem appears in Appendix A.1. THEOREM 2.1. Solving (P0) with a fixed λ0 is an NP-hard problem. 2.2. Stepwise algorithms and (P1). Due to the difficulty of solving (P0), a relaxat... |

314 | Feature selection: Evaluation, application, and small sample performance
- Jain, Zongker
- 1997
(Show Context)
Citation Context ...et application because they can identify more cases of equivalence. Subset selection has applications in feature selection. There are two major approaches in feature selection: filter and wrapper; see=-=[27, 28, 32]-=- for details. Our formulations are closely related to wrappers. A recent survey paper by Fan and Li [13] gives an excellent overview of the statistical challenges associated with highdimensional data,... |

302 | Just relax: Convex programming methods for identifying sparse signals
- Tropp
(Show Context)
Citation Context |

299 | Stable recovery of sparse overcomplete representations in the presence of noise
- Donoho, Elad, et al.
(Show Context)
Citation Context |

235 | Subset Selection in Regression - Miller - 1990 |

215 |
Sparse representations in unions of bases
- Gribonval, Nielsen
(Show Context)
Citation Context |

184 |
A Branch and Bound Algorithm for Feature Subset Selection
- Narendra, Fukunaga
- 1977
(Show Context)
Citation Context ...(i.e., the number of covariates) increases, the methods based on exhaustive search rapidly become impractical. Innovative ideas have been developed to reduce the number of subsets being searched; see =-=[17, 32]-=-, as well as some later improvements, [18, 35, 36, 40, 41]. All of these methods adopt a branch-and-bound (B&B) strategy. Improvements can be achieved by modifying the structure in B&B or by applying ... |

174 | A generalized uncertainty principle and sparse representation in pairs of bases
- Elad, Bruckstein
(Show Context)
Citation Context |

163 | A new approach to variable selection in least squares problems
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ... will be reviewed later, is the mathematical problem that is called upon in [45]. Recent advances (details and references are provided in Section 2.2) demonstrate that some stepwise algorithms (e.g., =-=[10, 38, 39]-=-) reveal the solution paths of problem (P1) while the parameter λ1 takes a range of values. More importantly, most of these algorithms take only a polynomial number of operations (i.e., they are polyn... |

158 | On sparse representations in arbitrary redundant bases
- Fuchs
- 2004
(Show Context)
Citation Context |

148 | The entire regularization path for the support vector machine
- Hastie, Rosset, et al.
(Show Context)
Citation Context ...ues of λ1, based on the idea of homotopy (see [38]). More recent analysis further demonstrates that stepwise algorithms can literally render the entire solution path in a large class of problems; see =-=[24]-=- and the references therein. The homotopy continuation method [39] andthesubdifferential are the key technical tools in this development. [42] and[37] are useful references. 2.3. Case studies. We pres... |

145 | On the lasso and its dual
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ... will be reviewed later, is the mathematical problem that is called upon in [45]. Recent advances (details and references are provided in Section 2.2) demonstrate that some stepwise algorithms (e.g., =-=[10, 38, 39]-=-) reveal the solution paths of problem (P1) while the parameter λ1 takes a range of values. More importantly, most of these algorithms take only a polynomial number of operations (i.e., they are polyn... |

142 | Matching pursuit in a time-frequency dictionary - Mallat, Zhang - 1993 |

127 |
The Risk Inflation Criterion for Multiple Regression
- Foster, E
- 1994
(Show Context)
Citation Context ...C) [1], Cp [20, 30], the Bayesian information criterion (BIC) [44], minimum description length (MDL) (see the equivalence between BIC and MDL in [25], Section 7.8), the risk inflation criterion (RIC) =-=[15]-=- and so on. We refer to George [19] for the details. In this paper, the “subset selection criteria” that appears in the title encompasses all of the foregoing criteria. Solving (P0) generally requires... |

125 |
Some comments on
- Mallows
- 1973
(Show Context)
Citation Context ...0), where ‖y − �x‖2 2 is the residual sum of squares [denoted by RSS(x)] and where the constant λ0 depends on the criterion. Some well-known results are the Akaike information criterion (AIC) [1], Cp =-=[20, 30]-=-, the Bayesian information criterion (BIC) [44], minimum description length (MDL) (see the equivalence between BIC and MDL in [25], Section 7.8), the risk inflation criterion (RIC) [15] and so on. We ... |

123 | Basis pursuit
- Chen
- 1995
(Show Context)
Citation Context ...ates. Another example regarding the performance of LARS can be found in [48], which has a different emphasis. This example is motivated by an early example in [4], which can be traced further back to =-=[3]-=- and[5] in the analysis of some stepwise algorithms (e.g., orthogonal matching pursuit) in signal processing. Our example is similar in spirit; however it is different in constructional details. 2.3.2... |

109 |
Regression by leaps and bounds
- Furnival, Wilson
- 1974
(Show Context)
Citation Context ...(i.e., the number of covariates) increases, the methods based on exhaustive search rapidly become impractical. Innovative ideas have been developed to reduce the number of subsets being searched; see =-=[17, 32]-=-, as well as some later improvements, [18, 35, 36, 40, 41]. All of these methods adopt a branch-and-bound (B&B) strategy. Improvements can be achieved by modifying the structure in B&B or by applying ... |

94 |
Some remarks on greedy algorithms
- DeVore, Temlyakov
- 1996
(Show Context)
Citation Context ...nother example regarding the performance of LARS can be found in [48], which has a different emphasis. This example is motivated by an early example in [4], which can be traced further back to [3] and=-=[5]-=- in the analysis of some stepwise algorithms (e.g., orthogonal matching pursuit) in signal processing. Our example is similar in spirit; however it is different in constructional details. 2.3.2. Subse... |

90 |
Least angle regression (with discussions
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ... will be reviewed later, is the mathematical problem that is called upon in [45]. Recent advances (details and references are provided in Section 2.2) demonstrate that some stepwise algorithms (e.g., =-=[10, 38, 39]-=-) reveal the solution paths of problem (P1) while the parameter λ1 takes a range of values. More importantly, most of these algorithms take only a polynomial number of operations (i.e., they are polyn... |

79 | Nonconcave penalized likelihood with a diverging number of parameters
- Fan, Peng
- 2004
(Show Context)
Citation Context ...l be: when can one verify that SCAD is indeed solved by a polynomialtime algorithm? That is, we want to derive some sufficient conditions similar to those in the present paper. Note that Fan and Peng =-=[14]-=- give a fundamental description of when oracle properties (as well as other properties) are achievable, while a recent manuscript by Zou [50] proves the oracle property for a method that is rooted in ... |

75 | Adaptive time-frequency decomposition - Davis, Mallat, et al. - 1994 |

70 | Approximation of functions over redundant dictionaries using coherence - Gilbert, Muthukrishnan, et al. - 2003 |

54 |
Linear inversion of band-limited reflection seismograms
- Santosa, Symes
- 1986
(Show Context)
Citation Context ...hms and (P1). Due to the difficulty of solving (P0), a relaxation idea has been proposed. The relaxation replaces the ℓ0-norm with the ℓ1-norm in the objective, which leads to (P1). Santosa and Symes =-=[43]-=- is considered the first modern appearance of the formulation (P1). The idea of relaxation has been studied extensively in the literature on sparse representation. Some representative papers are (roug... |

49 | On Measuring and Correcting the Effects of Data Mining and Model Selection - Ye - 1998 |

47 |
Homotopy continuation for sparse signal representation
- Malioutov, Çetin, et al.
(Show Context)
Citation Context |

45 | The Estimation of Prediction Error: Covariance Penalties and CrossValidation - Efron - 2004 |

39 | The Variable Selection Problem
- George
- 2000
(Show Context)
Citation Context ... the set � is small. 2.1. Subset selection criteria and (P0). There exists an extensive body of literature on the criteria regarding subset selection. Miller [31], Burnham and Anderson [2] and George =-=[19]-=- all give excellent reviews. An interesting fact is that a majority of these criterion can be unified under (P0), where ‖y − �x‖2 2 is the residual sum of squares [denoted by RSS(x)] and where the con... |

34 | Statistical challenges with high dimensionality: feature selection in knowledge discovery
- Fan, Li
- 2006
(Show Context)
Citation Context ... selection. There are two major approaches in feature selection: filter and wrapper; see[27, 28, 32] for details. Our formulations are closely related to wrappers. A recent survey paper by Fan and Li =-=[13]-=- gives an excellent overview of the statistical challenges associated with highdimensional data, including feature selection and feature extraction. Besides many contemporary applications, as summariz... |

33 |
Finite algorithms in optimization and data analysis
- Osborne
- 1985
(Show Context)
Citation Context ...e solution path in a large class of problems; see [24] and the references therein. The homotopy continuation method [39] andthesubdifferential are the key technical tools in this development. [42] and=-=[37]-=- are useful references. 2.3. Case studies. We present two cases that have been instructive to us. 2.3.1. An extreme example. We construct an extreme example, in which a sophisticated stepwise algorith... |

33 | Adaptive model selection - Shen, Ye - 2002 |

30 | A simple test to check the optimality of a sparse signal approximation
- Gribonval, Ventura, et al.
- 2006
(Show Context)
Citation Context |

29 | On the optimality of the Backward Greedy Algorithm for the subset selection problem - Couvreur, Bresler |

27 | Sparse representations for multiple measurements vectors (mmv) in an overcomplete dictionary - Chen, Huo - 2005 |

17 |
Construction of supersaturated designs through partially aliased interactions
- Wu
- 1993
(Show Context)
Citation Context ...esides many contemporary applications, as summarized in [13], other applications are foreseeable. For example, subset selection is a critical problem in supersaturated design. A citation search of Wu =-=[49]-=- will provide most of the existing literature. A numerically efficient condition on the optimality of subsets has the potential to identify a good design. 6. Conclusion. Stepwise algorithms can be num... |

17 | Information theory and the maximum likelihood principle - Akaike - 1973 |