Algorithms for Subset Selection in Linear Regression
 STOC'08
, 2008
Abstract

Cited by 30 (3 self)
We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under approximation preserving reductions, this problem is also equivalent to the“sparse approximation”problem of approximating signals concisely. We propose and analyze exact and approximation algorithms for several special cases of practical interest. We give an FPTAS when the covariance matrix has constant bandwidth, and exact algorithms when the associated covariance graph, consisting of edges for pairs of variables with nonzero correlation, forms a tree or has a large (known) independent set. Furthermore, we give an exact algorithm when the variables can be embedded into a line such that the covariance decreases exponentially in the distance, and a constantfactor approximation when the variables have no “conditional suppressor variables”. Much of our reasoning is based on perturbation results for the R 2 multiple correlation measure, frequently used as a measure for “goodnessoffit statistics”. It lies at the core of our FPTAS, and also allows us to extend exact algorithms to approximation algorithms when the matrix “nearly ” falls into one of the above classes. We also use perturbation analysis to prove approximation guarantees for the widely used “Forward Regression ” heuristic when the observation variables are nearly independent.
Bayesian Model Averaging in proportional hazard models: Assessing the risk of a stroke
 Applied Statistics
, 1997
Abstract

Cited by 28 (5 self)
Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for stroke. We introduce a technique based on the leaps and bounds algorithm which e ciently locates and ts the best models in the very large model space and thereby extends all subsets regression to Cox models. For each independent variable considered, the method provides the posterior probability that it belongs in the model. This is more directly interpretable than the corresponding Pvalues, and also more valid in that it takes account of model uncertainty. Pvalues from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable. In our data Bayesian model averaging predictively outperforms standard model selection methods for assessing
Validating ObjectOriented Design Metrics on a Commercial Java Application
, 2000
Abstract

Cited by 7 (2 self)
Many of the objectoriented metrics that have been developed by the research community are believed to measure some aspect of complexity. As such, they can serve as leading indicators of problematic classes, for example, those classes that are most faultprone. If faulty classes can be detected early in the development project's life cycle, mitigating actions can be taken, such as focused inspections. Prediction models using design metrics can be used to identify faulty classes early on. In this paper, we present a cognitive theory of objectoriented metrics and an empirical study which has as objectives to formally test this theory while validating the metrics and to build a postrelease faultproneness prediction model. The cognitive mechanisms which we apply in this study to objectoriented metrics are based on contemporary models of human memory. They are: familiarity, interference, and fan effects. Our empirical study was performed with data from a commercial Java application. We found that Depth of Inheritance Tree (DIT) is a good measure of familiarity and, as predicted, has a quadratic relationship with faultproneness. Our hypotheses were confirmed for Import Coupling to other classes, Export Coupling and Number of Children metrics. The Ancestor based Import Coupling metrics were not associated with faultproneness after controlling for the confounding effect of DIT. The prediction model constructed had a good accuracy. Finally, we formulated a cost savings model and applied it to our predictive model. This demonstrated a 42% reduction in postrelease costs if the prediction model is used to identify the classes that should be inspected.
unknown title
Abstract
Summary: The Humpedback theory of plant species richness, a theory related to Grime’s CSR ‘triangular’ model, has been widely discussed, and some evidence has been claimed in support of it. The theory suggests that species richness is maximal at intermediate levels of productivity, i.e., at intermediate positions on a stress/favourability gradient. We sought evidence for the theory from 90 stands of native podocarp/broadleaved and beech forest in the Coastal Otago region, with an adjustment made for the effect of stand area on species richness. There was no relation between adjusted species richness and an index of site stress/favourability, i.e., no support for the Humpedback theory. The theory may be inapplicable to woody vegetation, or it may be applicable only when the ‘favourable ’ end of the spectrum comprises agricultural communities, or support for the theory might be inflated in the literature by a wish to find ecological generalisations.