## Efficient Learning and Feature Selection in High-Dimensional Regression (2010)

### Cached

### Download Links

Citations: | 2 - 0 self |

### BibTeX

@MISC{Ting10efficientlearning,

author = {Jo-Anne Ting and Aaron D'Souza and Sethu Vijayakumar and Stefan Schaal},

title = { Efficient Learning and Feature Selection in High-Dimensional Regression},

year = {2010}

}

### OpenURL

### Abstract

We present a novel algorithm for efficient learning and feature selection in high-dimensional regression problems. We arrive at this model through a modification of the standard regression model, enabling us to derive a probabilistic version of the well-known statistical regression technique of backfitting. Using the expectation-maximization algorithm, along with variational approximation methods to overcome intractability, we extend our algorithm to include automatic relevance detection of the input features. This variational Bayesian least squares (VBLS) approach retains its simplicity as a linear model, but offers a novel statistically robust blackbox approach to generalized linear regression with high-dimensional inputs. It can be easily extended to nonlinear regression and classification problems. In particular, we derive the framework of sparse Bayesian learning, the relevance vector machine, with VBLS at its core, offering significant computational and robustness advantages for this class of methods. The iterative nature of VBLS makes it most suitable for real-time incremental learning, which is crucial especially in the application domain of robotics, brain-machine interfaces, and neural prosthetics, where realtime learning of models for control is needed. We evaluate our algorithm on synthetic and neurophysiological data sets, as well as on standard regression and classification benchmark data sets, comparing it with other competitive statistical approaches and demonstrating its suitability as a drop-in replacement for other generalized linear regression techniques.

### Citations

2171 | Support vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... first discuss the Relevance Vector Machine (RVM), a sparse Bayesian learning algorithm that operates in a framework similar to generalized linear regression. The Support Vector Machine (SVM), e.g., (=-=Cortes & Vapnik, 1995-=-), is a common and popu-Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 43 lar method for classification problems, but it can be extended to regression, resulting in Support ... |

1832 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...l issues with stepwise regression. These include its inability to cope with redundant dimensions (it deteriorates in the presence of collinearity) and its inability to shrink regression coefficients (=-=Tibshirani, 1996-=-), resulting in too-large regression coefficients. These properties, among others, make it problematic for high-dimensional data sets. 2.1.6 Partial Least Squares Regression Instead of seeking a low-d... |

1320 | Generalized Additive Models - Hastie, Tibshirani - 1990 |

1240 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ... an un-Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 29 informative distribution over α must be uniform over a log scale—corresponding to a Jeffreys prior (Jeffreys, 1946; =-=Gelman, Carlin, Stern, & Rubin, 2000-=-). We fulfill the requirement of an uninformative distribution by choosing the Gamma distribution parameters, aα and bα, appropriately (i.e, aα,bα → 0). Secondly, the Gamma distribution is analyticall... |

1113 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ... VBLS-RVM on exactly the same data used in Tipping (2001), along with an additional real-world large-scale data set.The data sets used for comparison include the following: • Ripley’s synthetic data (=-=Ripley, 1996-=-) • the Banana data set (Rätsh, Onoda, & Müller, 2001) • the Pima Diabetes data set 11 The Netflix Prize competition offers a grand prize for a rmse achieved that is ≤ 0.8563Ting, D’Souza, Vijayakuma... |

752 | Least angle regression
- Efron, Hastie, et al.
(Show Context)
Citation Context ... suitable for high-dimensional data sets, at the expense of an open parameter that needs to be set using cross-validation or through the optimization of a regularization “path” of solutions2 , e.g., (=-=Efron, Hastie, Johnstone, & Tibshirani, 2004-=-). 2.2 Data Structures for Fast Statistics Significant computational gains can be achieved by using smarter data structures to organize the information required for statistical analysis. Examples of t... |

732 | Gradient-based learning applied to document recognition
- LeCun, Bossou, et al.
- 1998
(Show Context)
Citation Context ...test sets of 500 samples. 12 The MNIST data set is publicly available from http://yann.lecun.com/exdb/mnist. It is a popular benchmark data set that has been analyzed in various forms by many (e.g., (=-=Lecun, Bottou, Bengio, & Haffner, 1998-=-; Keerthi, Chapelle, & DeCoste, 2006)).Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 63 6.4.2 Methods We compared VBLS-RVM to a RVM classifier, the SVM and logistic regress... |

658 | The Cascade-Correlation Learning Architecture
- Fahlman, Lebiere
- 1991
(Show Context)
Citation Context ...tes the data matrix with the mth dimension removed and b ¯m denotes the regression coefficient vector with the mth coefficient removed. The well-known cascade-correlation neural network architecture (=-=Fahlman & Lebiere, 1989-=-) can also be seen to have similar algorithmic underpinnings; the addition of each new hidden unit can be considered to be the tuning of an additional basis function in the sequence, with the previous... |

586 | An algorithm for finding best matches in logarithmic expected time
- Friedman, Bentley, et al.
- 1977
(Show Context)
Citation Context ...Significant computational gains can be achieved by using smarter data structures to organize the information required for statistical analysis. Examples of these include KD-trees and balltrees (J. H. =-=Friedman, Bentley, & Finkel, 1977-=-; Gray & Moore, 2001; Omohundro, 1990), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (Moore & Lee, 1998; Komarek & Moore, 2000) which s... |

552 | Sparse Bayesian learning and the relevance vector machine. The
- Tipping
- 2001
(Show Context)
Citation Context ...ntractable). Nevertheless, we can obtain successful approximate solutions (albeitTing, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 44 iteratively) by using the Laplace method (=-=Tipping, 2001-=-) or factorial variational approximations (Bishop & Tipping, 2000). Both these approximations require hyperparmeter updates for α that need re-estimation of the posterior covariance and mean of b as: ... |

397 | Mixtures of probabilistic principal component analysers - Tipping, M - 1999 |

254 | Soft margins for AdaBoost
- Rätsch, Onoda, et al.
- 2001
(Show Context)
Citation Context ...ed in Tipping (2001), along with an additional real-world large-scale data set.The data sets used for comparison include the following: • Ripley’s synthetic data (Ripley, 1996) • the Banana data set (=-=Rätsh, Onoda, & Müller, 2001-=-) • the Pima Diabetes data set 11 The Netflix Prize competition offers a grand prize for a rmse achieved that is ≤ 0.8563Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 62 • ... |

239 |
Statistical Field Theory
- Parisi
- 1988
(Show Context)
Citation Context ...itive term), the extraction of marginal probabilities of interest such as Q(b) and Q(α) is analytically intractable. Therefore, we use a factorial variational approximation (Ghahramani & Beal, 2000b; =-=Parisi, 1988-=-; Rustagi, 1976) to the true posterior, in which we assume that the posterior distribution factorizes3 over the variables of interest, i.e., we restrict ourselves to a family of distributions of the f... |

236 |
Regression diagnostics: identifying influential data and sources of collinearity
- Belsley, Kuh, et al.
- 1980
(Show Context)
Citation Context ...ecomes increasingly computationally expensive (approximately O(d3 )) and numerically brittle. While one can attempt to reduce the complexity down to O(d2 ) with efficient matrix inversion techniques (=-=Belsley, Kuh, & Welsch, 1980-=-), solutions to this problem typically fall into one of two categories:Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 6 1. Dimensionality reduction for regression: Those tha... |

232 |
A Statistical View of Some Chemometrics Regression Tools (with discussion)," Technometries
- Friedman
- 1993
(Show Context)
Citation Context ...o expensive matrix inversion or eigendecomposition and, thus, is well suited to the high-dimensional, yet severely underconstrained data sets in applications such as near infrared (NIR) spectrometry (=-=Frank & Friedman, 1993-=-). The number of projection directions found by PLS is only bound by the dimensionality of the data, with each univariate regression on successive projection components further serving to reduce the r... |

225 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, GE
- 1996
(Show Context)
Citation Context ... T of the data, we can take the output into consideration when determining the appropriate lower-dimensional manifold. 2.1.2 Joint-space Factor Analysis for Regression Factor analysis (Everitt, 1984; =-=Ghahramani & Hinton, 1997-=-) is a density estimation technique which assumes that the observed data z is generated from a lower dimensional process characterized by K latent or hidden variables v as follows: zi = Wvi + ǫi where... |

222 | Gaussian processes for regression
- Williams, Rasmussen
- 1996
(Show Context)
Citation Context ...ompare the generalization performance of VBLS-RVM on a sinc function approximation problem to other competitive nonlinear regression techniques such as the RVM, SVR, Gaussian Process (GP) regression (=-=Williams & Rasmussen, 1996-=-) and Locally Weighted Projection Regression (LWPR) (Vijayakumar & Schaal, 2000). Note that Tipping proposes an optimization of the distance metric λ, based on gradient ascent in the log likelihood (T... |

215 | The Relevance Vector Machine
- Tipping
- 2000
(Show Context)
Citation Context ...oximate solutions (albeitTing, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 44 iteratively) by using the Laplace method (Tipping, 2001) or factorial variational approximations (=-=Bishop & Tipping, 2000-=-). Both these approximations require hyperparmeter updates for α that need re-estimation of the posterior covariance and mean of b as: ( 〈 〉 1 ∑N Σb = diag (〈α〉) + ( ) 1 µb = ψy Σb ψy i=1 kik T i ) −1... |

194 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1939
(Show Context)
Citation Context ...scale parameter, an un-Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 29 informative distribution over α must be uniform over a log scale—corresponding to a Jeffreys prior (=-=Jeffreys, 1946-=-; Gelman, Carlin, Stern, & Rubin, 2000). We fulfill the requirement of an uninformative distribution by choosing the Gamma distribution parameters, aα and bα, appropriately (i.e, aα,bα → 0). Secondly,... |

148 | Variational inference for Bayesian mixtures of factor analysers - Ghahramani, Beal |

136 | Feature selection, L1 vs. L2 regularization, and rotational invariance - NG |

132 |
An introduction to latent variable models
- Everitt
- 1984
(Show Context)
Citation Context ...space z = [xTy] T of the data, we can take the output into consideration when determining the appropriate lower-dimensional manifold. 2.1.2 Joint-space Factor Analysis for Regression Factor analysis (=-=Everitt, 1984-=-; Ghahramani & Hinton, 1997) is a density estimation technique which assumes that the observed data z is generated from a lower dimensional process characterized by K latent or hidden variables v as f... |

120 | Cached sufficient statistics for efficient machine learning with large datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ...alltrees (J. H. Friedman, Bentley, & Finkel, 1977; Gray & Moore, 2001; Omohundro, 1990), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (=-=Moore & Lee, 1998-=-; Komarek & Moore, 2000) which speed up computations 2 That is, solutions that minimize the L1 loss function. When the value of the open/tuning parameter changes, regularization “paths” of solutions a... |

113 | Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transaction on pattern analysis and machine learning
- Krishnapuram, Carin, et al.
- 2005
(Show Context)
Citation Context ...Faul (2003), iii) VBLS-RVM and iv) the original RVM of Tipping (2001). Other competitive classifiers, aside from the primal SVM (Keerthi et al., 2006), include Sparse Multinomial Logistic Regression (=-=Krishnapuram, Carin, Figueiredo, & Hartemink, 2005-=-) and the doubly regularized SVM (Wang, Zhu, & Zou, 2006), to list a few. Note that the original RVM does not scale well to large-scale data sets due to its O(N 3 ) computational complexity (per EM it... |

105 | Bayesian parameter estimation via variational methods, Statistics and Computing 10 - Jaakkola, Jordan - 1999 |

94 |
Parameter estimation of superimposed signals using the EM algorithm
- Feder, Weinstein
- 1988
(Show Context)
Citation Context ... we obtain our EM update in Eq. (20) exactly. Indeed, this is a probabilistic version of backfitting. A similar EM algorithm and model structure has been proposed in the context of signal processing (=-=Feder & Weinstein, 1988-=-), but we believe this is the first time that the connection of this probabilistic derivation to the backfitting algorithm has been demonstrated. As we show in Sec. 4, this allows us to place this cla... |

90 | N-body problems in statistical learning
- Gray, Moore
- 2001
(Show Context)
Citation Context ... be achieved by using smarter data structures to organize the information required for statistical analysis. Examples of these include KD-trees and balltrees (J. H. Friedman, Bentley, & Finkel, 1977; =-=Gray & Moore, 2001-=-; Omohundro, 1990), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (Moore & Lee, 1998; Komarek & Moore, 2000) which speed up computations... |

75 | The anchors hierarchy: Using the triangle inequality to survive high dimensional data
- Moore
- 2000
(Show Context)
Citation Context ...tley, & Finkel, 1977; Gray & Moore, 2001; Omohundro, 1990), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (Moore & Lee, 1998; Komarek & =-=Moore, 2000-=-) which speed up computations 2 That is, solutions that minimize the L1 loss function. When the value of the open/tuning parameter changes, regularization “paths” of solutions are generated.Ting, D’S... |

70 | Muscle and movement representations in the primary motor cortex - Kakei, Hoffman, et al. - 1999 |

68 | Comparison of Approximate Methods for Handling Hyperparameters
- MacKay
- 1999
(Show Context)
Citation Context ... treat these variables as hyperparameters and place prior distributions over them. Since exact solutions are typically intractable, we can either optimize them b using maximum a posteriori estimates (=-=MacKay, 1999-=-) or by Monte Carlo techniques (Williams & Rasmussen, 1996).Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 47 Note that there are several optimizations suggested by Tipping ... |

68 | Pattern Recognition and Neural Networks (Cambridge - Ripley - 1996 |

67 | Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach - Wold - 1975 |

65 | Fast marginal likelihood maximisation for sparse Bayesian models
- Tipping, Faul
- 2003
(Show Context)
Citation Context ...nger computational times than the fast RVM and SVM. The VBLS-RVM could be modified to accommodate large-scale data sets by greedily adding basis vectors to the design matrix (similar to that done in (=-=Tipping & Faul, 2003-=-)). On average, the fast RVM of Tipping and Faul (2003) performs faster than the RVM and the VBLS-RVM—which is unsurprising, given the modified RVM adds basis vectors in a greedy fashion, potentially ... |

60 |
A Simple and Efficient Algorithm for Gene Selection using Sparse Logistic Regression
- Shevade, Keerthi
- 2003
(Show Context)
Citation Context ...al quadratic programming or interior- point methods, e.g., (Kim, Koh, Lustig, Boyd, & Gorinevsky, 2007), coordinate descent methods (J. Friedman, Hastie, & Tibshirani, 2007), the Gauss-Seidel method (=-=Shevade & Keerthi, 2003-=-), generalized iterative scaling (Goodman, 2004), and iterative re-weighted least squares (Lokhorst, 1999; Lee, Lee, Abbeel, & Ng, 2006).Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional ... |

58 | Building Support Vector Machines with Reduced Classifier Complexity
- Keerthi, Chapelle, et al.
(Show Context)
Citation Context ...ata set is publicly available from http://yann.lecun.com/exdb/mnist. It is a popular benchmark data set that has been analyzed in various forms by many (e.g., (Lecun, Bottou, Bengio, & Haffner, 1998; =-=Keerthi, Chapelle, & DeCoste, 2006-=-)).Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 63 6.4.2 Methods We compared VBLS-RVM to a RVM classifier, the SVM and logistic regression. As mentioned previously, the li... |

57 | Exponential priors for maximum entropy models
- Goodman
- 2004
(Show Context)
Citation Context ..., (Kim, Koh, Lustig, Boyd, & Gorinevsky, 2007), coordinate descent methods (J. Friedman, Hastie, & Tibshirani, 2007), the Gauss-Seidel method (Shevade & Keerthi, 2003), generalized iterative scaling (=-=Goodman, 2004-=-), and iterative re-weighted least squares (Lokhorst, 1999; Lee, Lee, Abbeel, & Ng, 2006).Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 17 The LASSO estimate ˆ blasso is th... |

46 | Bumptrees for efficient function, constraint, and classification learning
- Omohundro
- 1991
(Show Context)
Citation Context ...g smarter data structures to organize the information required for statistical analysis. Examples of these include KD-trees and balltrees (J. H. Friedman, Bentley, & Finkel, 1977; Gray & Moore, 2001; =-=Omohundro, 1990-=-), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (Moore & Lee, 1998; Komarek & Moore, 2000) which speed up computations 2 That is, solut... |

39 |
A method for large-scale l1-regularized least squares
- Kim, Koh, et al.
- 2007
(Show Context)
Citation Context ...ods for solving L1-regularized regression problems (especially large-scale problems) include convex optimization techniques such as sequential quadratic programming or interior- point methods, e.g., (=-=Kim, Koh, Lustig, Boyd, & Gorinevsky, 2007-=-), coordinate descent methods (J. Friedman, Hastie, & Tibshirani, 2007), the Gauss-Seidel method (Shevade & Keerthi, 2003), generalized iterative scaling (Goodman, 2004), and iterative re-weighted lea... |

37 | Graphical models and variational methods - Ghahramani, Beal - 2001 |

33 |
Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables
- Derksen, Keselman
- 1992
(Show Context)
Citation Context ...cit meta-level knowledge of the data is known beforehand, the estimation of this quantity would require expensive cross-validation to avoid overfitting. 2.1.5 Stepwise regression Stepwise regression (=-=Derksen & Keselman, 1992-=-) is a popular statistical technique for large data sets that chooses dimensions to include in a regression model. The selection of dimensions for the model can be in a forward or backward manner. For... |

26 | Local dimensionality reduction
- Schaal, Vijayakumar, et al.
- 1998
(Show Context)
Citation Context ...ression to a set of independent univariate regressions along each of the orthogonal principal component directions. A serious drawback of PCR is that it is based purely on variance in the input data (=-=Schaal, Vijayakumar, & Atkeson, 1998-=-). The regression solution is therefore highly sensitive to pre-Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 8 processing operations such as sphering, which modify the per... |

20 |
Variational Methods in Statistics
- Rustagi
- 1976
(Show Context)
Citation Context ...he extraction of marginal probabilities of interest such as Q(b) and Q(α) is analytically intractable. Therefore, we use a factorial variational approximation (Ghahramani & Beal, 2000b; Parisi, 1988; =-=Rustagi, 1976-=-) to the true posterior, in which we assume that the posterior distribution factorizes3 over the variables of interest, i.e., we restrict ourselves to a family of distributions of the form Q(Z,b,α) = ... |

16 |
Changes in the temporal pattern of primary motor cortex activity in a directional isometric force versus limb movement task
- Sergio, Kalaska
- 1998
(Show Context)
Citation Context ...deteriorated performance as soon as co-linear inputs are introduced. 6.2 Predicting EMG Activity from Neural Firing 6.2.1 Data sets We analyzed data from two different neurophysiological experiments (=-=Sergio & Kalaska, 1998-=-; Kakei, Hoffman, & Strick, 1999) involving monkeys trained to perform different arm movements while having their M1 neural firing rates and EMG activity recorded. The first experiment (Sergio & Kalas... |

15 | A dynamic adaptation of AD-trees for efficient machine learning on large data sets
- Komarek, Moore
- 2000
(Show Context)
Citation Context ...edman, Bentley, & Finkel, 1977; Gray & Moore, 2001; Omohundro, 1990), which allow caching of sufficient statistics over recursively smaller regions of the data space, and AD-trees (Moore & Lee, 1998; =-=Komarek & Moore, 2000-=-) which speed up computations 2 That is, solutions that minimize the L1 loss function. When the value of the open/tuning parameter changes, regularization “paths” of solutions are generated.Ting, D’S... |

13 | Bayesian backfitting - Hastie, Tibshirani - 1993 |

12 | The doubly regularized support vector machine
- Wang, Zhu, et al.
- 2006
(Show Context)
Citation Context ...tive classifiers, aside from the primal SVM (Keerthi et al., 2006), include Sparse Multinomial Logistic Regression (Krishnapuram, Carin, Figueiredo, & Hartemink, 2005) and the doubly regularized SVM (=-=Wang, Zhu, & Zou, 2006-=-), to list a few. Note that the original RVM does not scale well to large-scale data sets due to its O(N 3 ) computational complexity (per EM iteration). 6.4.3 Results Table 6 shows the classification... |

10 |
Principal component regression in exploratory statistical research
- Massey
- 1965
(Show Context)
Citation Context ...ately predict the output.Ting, D’Souza, Vijayakumar, Schaal Efficient High-Dimensional Regression 7 2.1.1 Principal Component Regression The underlying basis of principal component regression (PCR) (=-=Massey, 1965-=-) is that the low-dimensional subspace which explains the most variance in the x also captures the most essential information required to predict y. Starting with the empirical covariance matrix ΣPCR ... |

6 | Bayesian data analysis. London: Chapman and Hall - Gelman, Carlin, et al. - 2000 |

6 | Predicting EMG data from M1 neurons with variational Bayesian least squares - Ting, D’Souza, et al. - 2005 |

6 |
Locally weighted projection regression: Incremental real time learning in high dimensional space
- Vijayakumar, Schaal
- 2000
(Show Context)
Citation Context ...ion problem to other competitive nonlinear regression techniques such as the RVM, SVR, Gaussian Process (GP) regression (Williams & Rasmussen, 1996) and Locally Weighted Projection Regression (LWPR) (=-=Vijayakumar & Schaal, 2000-=-). Note that Tipping proposes an optimization of the distance metric λ, based on gradient ascent in the log likelihood (Tipping, 2001). We can also compute such a gradient for VBLSRVM as: ∂ 〈log p(y,Z... |