## Algorithms for Subset Selection in Linear Regression (2008)

### Cached

### Download Links

- [www-rcf.usc.edu]
- [www-rcf.usc.edu]
- [www-bcf.usc.edu]
- [pollux.usc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | STOC'08 |

Citations: | 31 - 3 self |

### BibTeX

@MISC{Das08algorithmsfor,

author = {Abhimanyu Das and David Kempe},

title = {Algorithms for Subset Selection in Linear Regression },

year = {2008}

}

### OpenURL

### Abstract

We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under approximation preserving reductions, this problem is also equivalent to the“sparse approximation”problem of approximating signals concisely. We propose and analyze exact and approximation algorithms for several special cases of practical interest. We give an FPTAS when the covariance matrix has constant bandwidth, and exact algorithms when the associated covariance graph, consisting of edges for pairs of variables with nonzero correlation, forms a tree or has a large (known) independent set. Furthermore, we give an exact algorithm when the variables can be embedded into a line such that the covariance decreases exponentially in the distance, and a constant-factor approximation when the variables have no “conditional suppressor variables”. Much of our reasoning is based on perturbation results for the R 2 multiple correlation measure, frequently used as a measure for “goodness-of-fit statistics”. It lies at the core of our FPTAS, and also allows us to extend exact algorithms to approximation algorithms when the matrix “nearly ” falls into one of the above classes. We also use perturbation analysis to prove approximation guarantees for the widely used “Forward Regression ” heuristic when the observation variables are nearly independent.

### Citations

5102 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ...ndom variables, respectively. The matrix of covariances between Xi and Xj is denoted by C, so ci,j = Cov[Xi, Xj]. The vector b denotes the covariances between Z and the Xi, so bi = Cov[Z, Xi]. Recall =-=[16]-=- that a matrix C is a covariance matrix iff it is positive semi-definite. We use CS to denote the submatrix with row and column set S, and bS to denote the vector with only entries bi for i ∈ S. We de... |

4079 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...,rbr, and c ′ i,j = ci,j − P r∈S ci,rcj,r. Because C′ is a covariance matrix, and thus positive semidefinite, the function b′C′ −1 b′ is convex in each entry of b′ and C′ (see, e.g., Section 3.1.7 of =-=[2]-=-). The equivalent formulation above allows us to reduce the problem to the Shaped Partition Problem [17, 25], defined as follows. A p-shape of n is a tuple λ = (λ1, λ2, . . . , λp) of nonnegative inte... |

2065 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context .... However, [7] points out that calculating this threshold might itself be NP-hard. Instead of restricting the number of variables sampled, other widely used regression models such as the Lasso method =-=[31]-=-, and the Elastic Net [39], prescribe different constraints on the regression coefficient vectors. For instance, the Lasso method gives an upper bound on the l1 norm of the regression coefficients vec... |

1536 |
Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences". Lawrence Erlbaum Associates, Inc., Publishers. Third Edition
- Cohen, Cohen, et al.
- 2003
(Show Context)
Citation Context ... been sampled), then we prove that Forward Regression gives a (1 − 1/e) approximation for maximizing R 2 Z,S (Section 8). This result is important in that it has been observed empirically in the past =-=[5, 36]-=- that much of the difficulty in subset selection results from such “suppressor variables”. Our result confirms this empirical observation analytically. 1.2 Related Work With the advent of sensor netwo... |

810 |
Stable signal recovery from incomplete and inaccurate measurements
- Candes, Romberg, et al.
- 2005
(Show Context)
Citation Context ...und on the performance of a greedy algorithm for this problem. This problem has also been studied extensively in recent years in the context of sparse signal recovery from a set of noisy observations =-=[3, 12, 11, 34, 37]-=-. Almost all of these results use convex relaxation techniques to replace the l0 norm constraint with an l1 norm constraint, and solve the corresponding convex optimization problem. They then prove ce... |

774 |
Applied Multivariate Statistical Analysis
- Johnson, Wichern
- 1998
(Show Context)
Citation Context ...ector b of covariances between Z and the Xi. The optimization problem can then be phrased as follows: Given C and b, find a set S of size at most k so as to minimize the mean squared prediction error =-=[10, 18]-=- Err(Z, S) := E ˆ (Z − P i∈S αiXi)2˜ , where the αi are the optimal regression coefficients specifically for the set S. Alternatively, maximize the squared multiple correlation [10] R 2 Z,S := Var[Z]−... |

674 | Matrix Perturbation Theory
- Stewart, Sun
- 1990
(Show Context)
Citation Context ... |b T C −1 b−b T (C+E) −1 b| |b T C −1 b| ≤ λ1‖(C + E) −1 − C −1 ‖2. To bound the right-hand side, we can invoke a well-known theorem from perturbation theory of matrix inverses, Theorem III.2.5 from =-=[28]-=-. Assuming that ‖C −1 E‖2 < 1, the theorem, submultiplicativity of matrix norms, and the fact that ‖C −1 ‖2 = 1/λk, together imply that ≤ λ1‖(C + E) −1 − C −1 ‖2 ≤ λ1 ‖C−1E‖2 1−‖C−1E‖2 · ‖C−1‖2 λ1 ‖C−... |

564 | Greed is good: algorithmic results for sparse approximation
- Tropp
(Show Context)
Citation Context ...ssion heuristic (also known as Forward Selection, and defined in Section 3.1), under the assumption that the Xi variables have small correlations. The guarantees for Err(Z, S) are similar to those in =-=[14, 32]-=- for slightly different greedy heuristics, under similar independence assumptions. In terms of the covariance graph G induced by Z and the Xi, the previous result assumes G to be a star (except for ed... |

454 |
An analysis of approximations for maximizing submodular set functions i
- Nemhauser, Wolsey, et al.
- 1978
(Show Context)
Citation Context ...e idea of the e proof is to show that the absence of suppressor variables implies that the objective function R 2 Z,S is submodular in S. Then, by a well-known result of Nemhauser, Wolsey, and Fisher =-=[6, 24]-=-, the greedy algorithm for maximization is a (1 − 1 ) approximation. We note here that an alternate NPe hardness proof (omitted here due to lack of space) shows that the subset selection problem remai... |

415 | Regularization and variable selection via the elastic net
- Zou, Hastie
- 2005
(Show Context)
Citation Context ...that calculating this threshold might itself be NP-hard. Instead of restricting the number of variables sampled, other widely used regression models such as the Lasso method [31], and the Elastic Net =-=[39]-=-, prescribe different constraints on the regression coefficient vectors. For instance, the Lasso method gives an upper bound on the l1 norm of the regression coefficients vector (whereas the set size ... |

404 | Data streams: Algorithms and applications
- Muthukrishnan
- 2003
(Show Context)
Citation Context ... The other direction defines the dictionary through the Cholesky Decomposition of the covariance matrix C. Details are deferred to the full version of this paper due to space constraints. As shown in =-=[8, 23, 22]-=- (where the results are phrased equivalently in terms of sparse approximation), the subset selection problem is NP-complete, and the minimization version can in general not be approximated to within a... |

365 | Model-driven Data Acquisition in Sensor Networks
- Deshpande, Guestrin, et al.
- 2004
(Show Context)
Citation Context ...cally. 1.2 Related Work With the advent of sensor networks, much recent work has focused on tradeoffs between the accuracy in measurements and the energy expended in retrieving data. Deshpande et al. =-=[9]-=- use statistical models of sensor data to extrapolate sensor readings based on already collected sensor data, and reduce the number of sensors needed to answer a given query. Guestrin et al. [15] and ... |

365 | For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution
- Donoho
- 2006
(Show Context)
Citation Context ...und on the performance of a greedy algorithm for this problem. This problem has also been studied extensively in recent years in the context of sparse signal recovery from a set of noisy observations =-=[3, 12, 11, 34, 37]-=-. Almost all of these results use convex relaxation techniques to replace the l0 norm constraint with an l1 norm constraint, and solve the corresponding convex optimization problem. They then prove ce... |

336 |
Sparse approximate solutions to linear systems
- Natarajan
- 1995
(Show Context)
Citation Context ...a desired estimation error ǫ in predicting the input vector. Assuming an n × m dictionary φ, the aim is to find a coefficient vector α0 ∈ R m that minimizes ‖α0‖0 subject to ‖y − φα0‖2 ≤ ǫ. Natarajan =-=[23]-=- proved a weak approximation bound on the performance of a greedy algorithm for this problem. This problem has also been studied extensively in recent years in the context of sparse signal recovery fr... |

326 | Just relax: convex programming methods for identifying sparse signals in noise
- Tropp
- 2006
(Show Context)
Citation Context ...und on the performance of a greedy algorithm for this problem. This problem has also been studied extensively in recent years in the context of sparse signal recovery from a set of noisy observations =-=[3, 12, 11, 34, 37]-=-. Almost all of these results use convex relaxation techniques to replace the l0 norm constraint with an l1 norm constraint, and solve the corresponding convex optimization problem. They then prove ce... |

321 | Stable recovery of sparse overcomplete representations in the presence of noise
- Donoho, Elad, et al.
(Show Context)
Citation Context |

270 |
Subset selection in regression
- Miller
- 1990
(Show Context)
Citation Context ... (much smaller) subset of k variables Xi to predict Z in the future. This problem of selecting the k-subset of variables that “best” predicts Z is known as the subset selection problem for regression =-=[21]-=-. Natural applications of this problem abound. In medical or social studies, one often wants to predict risks or future behaviors (heart disease, failure in school, . . .) in terms of observable quant... |

195 | Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies - Krause, Singh, et al. - 2008 |

76 | Approximation of functions over redundant dictionaries using coherence
- Gilbert, Muthukrishnan, et al.
- 2003
(Show Context)
Citation Context ...ssion heuristic (also known as Forward Selection, and defined in Section 3.1), under the assumption that the Xi variables have small correlations. The guarantees for Err(Z, S) are similar to those in =-=[14, 32]-=- for slightly different greedy heuristics, under similar independence assumptions. In terms of the covariance graph G induced by Z and the Xi, the previous result assumes G to be a star (except for ed... |

64 |
thresholds for noisy and high-dimensional recovery of sparsity using `1-constrained quadratic programming (Lasso
- Wainwright
(Show Context)
Citation Context |

61 |
Greedy adaptive approximation
- Davis, Mallat, et al.
- 1997
(Show Context)
Citation Context ...ion and hardness results. 1.1 Our Results The subset-selection problem is NP-hard in general. In fact, it is even NP-hard to decide whether any k-subset reduces the mean squared prediction error to 0 =-=[8]-=-. As a result, no multiplicative approximation guarantee for the Err(Z, S) objective is possible in general. Due to the highly non-linear nature of either objective function, a general approximation f... |

56 |
An exact algorithm for maximum entropy sampling
- Ko, Lee, et al.
- 1995
(Show Context)
Citation Context ...dels of sensor data to extrapolate sensor readings based on already collected sensor data, and reduce the number of sensors needed to answer a given query. Guestrin et al. [15] and Anstreicher et al. =-=[1, 19]-=- study related problems that deal with maximizing the entropy or mutual information of a subset: the aim is to find the most informative k-subset in a set of sensors, measured in terms of the joint en... |

53 | Nonlinear methods of approximation
- Temlyakov
- 2003
(Show Context)
Citation Context ...y sparse, then for all dictionary matrices satisfying an isometry condition, the α1 vector obeys ‖α1 − α0‖2 ≤ cǫ, for a given constant c. In the mathematical approximation theory community, Temlyakov =-=[29, 30]-=- analyzed convergence theorems to prove bounds on the power decay of approximation using sophisticated mathematical techniques in Hilbert and Banach spaces. 2. PRELIMINARIES The goal is to estimate a ... |

35 |
Topics in sparse approximation
- Tropp
- 2004
(Show Context)
Citation Context ...ingle stage of greedy Orthogonal Matching Pursuit. Since the bounds of [14, 35, 32] are similar to our bounds for Forward Regression, we defer a precise statement and comparison to Section 3.1. Tropp =-=[33]-=- also presents a detailed study of variants of sparse approximation, and provides interesting results regarding the performance of greedy and convex relaxation methods when the dictionary is almost or... |

32 | On the optimality of the backward greedy algorithm for the subset selection problem
- Couvreur, Bresler
(Show Context)
Citation Context ...ds are Forward, Backward and Stepwise Regression, and Branch and Bound techniques. While there has not been any rigorous algorithmic analysis of Forward and Stepwise Regression, Couvreur and Bressler =-=[7]-=- analyzed Backward Regression and showed that if the optimal prediction error is smaller than a certain threshold, Backward Regression will select the optimal subset. However, [7] points out that calc... |

27 |
Dynamic programming algorithms for recognizing small bandwidth graphs in polynomial time
- Saxe
- 1980
(Show Context)
Citation Context ...correlations between variables are significant only within a small time interval. Additionally, it can model spatial correlations between sensor variables placed on a line. Using an algorithm of Saxe =-=[27]-=-, an ordering with bandwidth β can be found in polynomial time O(n β ) if it exists. We assume from now on that the variables are ordered accordingly. Theorem 4.1. If ˜ G(C) has bounded bandwidth β, t... |

24 |
Improved sparse approximation over quasi-incoherent dictionaries
- Tropp, Gilbert, et al.
- 2003
(Show Context)
Citation Context ...o-stage algorithm consisting of two slightly different greedy heuristics: “Matching Pursuit” and “Orthogonal Matching Pursuit”. The approximation guarantees were subsequently improved by Tropp et al. =-=[35, 32]-=-, the latter paper using just a single stage of greedy Orthogonal Matching Pursuit. Since the bounds of [14, 35, 32] are similar to our bounds for Forward Regression, we defer a precise statement and ... |

21 | A polynomial time algorithm for shaped partition problems
- Hwang, Onn, et al.
- 1999
(Show Context)
Citation Context ..., the function b ′ C ′−1 b ′ is convex in each entry of b ′ and C ′ (see, e.g., Section 3.1.7 of [2]). The equivalent formulation above allows us to reduce the problem to the Shaped Partition Problem =-=[17, 25]-=-, defined as follows. A p-shape of n is a tuple λ = (λ1, λ2, . . . , λp) of nonnegative integers such that Pp i=1 λi = n. Given a set Λ of p-shapes, a p-partition π of {1, . . . , n} into p disjoint s... |

18 |
Greedy algorithms and m-term approximation with regard to redundant dictionaries
- Temlyakov
- 1999
(Show Context)
Citation Context ...y sparse, then for all dictionary matrices satisfying an isometry condition, the α1 vector obeys ‖α1 − α0‖2 ≤ cǫ, for a given constant c. In the mathematical approximation theory community, Temlyakov =-=[29, 30]-=- analyzed convergence theorems to prove bounds on the power decay of approximation using sophisticated mathematical techniques in Hilbert and Banach spaces. 2. PRELIMINARIES The goal is to estimate a ... |

17 |
Location of bank accounts to optimize float
- Cornuejols, Fisher, et al.
- 1977
(Show Context)
Citation Context ...e idea of the e proof is to show that the absence of suppressor variables implies that the objective function R 2 Z,S is submodular in S. Then, by a well-known result of Nemhauser, Wolsey, and Fisher =-=[6, 24]-=-, the greedy algorithm for maximization is a (1 − 1 ) approximation. We note here that an alternate NPe hardness proof (omitted here due to lack of space) shows that the subset selection problem remai... |

17 |
A generalised R̄2 criterion for regression models estimated by the instrumental variables method,” Econometrica
- Pesaran, Smith
- 1994
(Show Context)
Citation Context ...ue to S. It can intuitively be viewed as the proportion of variance of the predictor variable that can be explained by the set of observation variables, and is often used in the statistics literature =-=[4, 13, 26]-=- to measure the quality of prediction during regression analysis. While other functions are also frequently usedto measure the accuracy of regression (such as conditional entropy [1], mean absolute e... |

9 | Maximum-entropy remote sampling
- Anstreicher, Fampa, et al.
(Show Context)
Citation Context ...terature [4, 13, 26] to measure the quality of prediction during regression analysis. While other functions are also frequently usedto measure the accuracy of regression (such as conditional entropy =-=[1]-=-, mean absolute error, etc.), we only focus on the above two objectives in this paper. Naturally, the optimization problems for both are equivalent at optimality, but the two functions differ signific... |

7 |
Frequency of selecting noise variables in subset regression analyses: A simulation study, The American Statistician 41
- Flack, Chang
- 1987
(Show Context)
Citation Context ...ue to S. It can intuitively be viewed as the proportion of variance of the predictor variable that can be explained by the set of observation variables, and is often used in the statistics literature =-=[4, 13, 26]-=- to measure the quality of prediction during regression analysis. While other functions are also frequently usedto measure the accuracy of regression (such as conditional entropy [1], mean absolute e... |

6 | Leveraging redundancy in sampling-interpolation applications for sensor networks
- Liaskovitis, Schurgers
- 2007
(Show Context)
Citation Context ...or mutual information of the subset of variables. Recently, Liaskovitis and Schurgers suggested a formulation essentially equivalent to the subset selection problem for choosing sensor sets to sample =-=[20]-=-. The formulation of the subset selection problem presented in this paper coincides with that in the statistics community [21]. Many heuristics have been proposed; the book by Miller [21] contains an ... |

5 |
Suppressor variables and the semipartial correlation coefficient
- Velicer
- 1978
(Show Context)
Citation Context ... been sampled), then we prove that Forward Regression gives a (1 − 1/e) approximation for maximizing R 2 Z,S (Section 8). This result is important in that it has been observed empirically in the past =-=[5, 36]-=- that much of the difficulty in subset selection results from such “suppressor variables”. Our result confirms this empirical observation analytically. 1.2 Related Work With the advent of sensor netwo... |

4 |
Some effects of errors of measurement on multiple correlation
- Cochran
- 1970
(Show Context)
Citation Context ...ue to S. It can intuitively be viewed as the proportion of variance of the predictor variable that can be explained by the set of observation variables, and is often used in the statistics literature =-=[4, 13, 26]-=- to measure the quality of prediction during regression analysis. While other functions are also frequently usedto measure the accuracy of regression (such as conditional entropy [1], mean absolute e... |

3 | An exact algorithm for maximum entropy sampling - Anstreicher, Fampa, et al. - 1995 |

3 |
Statistics for the Social and Behavioral Sciences
- Diekhoff
- 2002
(Show Context)
Citation Context ...ector b of covariances between Z and the Xi. The optimization problem can then be phrased as follows: Given C and b, find a set S of size at most k so as to minimize the mean squared prediction error =-=[10, 18]-=- Err(Z, S) := E ˆ (Z − P i∈S αiXi)2˜ , where the αi are the optimal regression coefficients specifically for the set S. Alternatively, maximize the squared multiple correlation [10] R 2 Z,S := Var[Z]−... |

1 |
The vector partition problem for convex optimization functions
- Onn, Schulman
(Show Context)
Citation Context ..., the function b ′ C ′−1 b ′ is convex in each entry of b ′ and C ′ (see, e.g., Section 3.1.7 of [2]). The equivalent formulation above allows us to reduce the problem to the Shaped Partition Problem =-=[17, 25]-=-, defined as follows. A p-shape of n is a tuple λ = (λ1, λ2, . . . , λp) of nonnegative integers such that Pp i=1 λi = n. Given a set Λ of p-shapes, a p-partition π of {1, . . . , n} into p disjoint s... |

1 |
Suppressor variable(s) importance within a regression model
- Walker
(Show Context)
Citation Context ...mation for maximizing the error reduction R 2 Z,S. In the statistics community, so-called suppressor variables [5, 36] have frequently been considered as “unfavorable attributes of regression models” =-=[38]-=-. Intuitively, a variable Xj is a suppressor variable if it “suppresses” the correlation between some other Xi and the predictor variable Z, in the sense that Xi appears not (or only slightly) correla... |