Results 1  10
of
28
The Prediction of Faulty Classes Using ObjectOriented Design Metrics
, 1999
"... Contemporary evidence suggests that most field faults in software applications are found in a smafi percentage of the software's components. This means that if these faulty software components can be detected early in the development project's life cycle, mitigating actions can be taken, such as a ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
Contemporary evidence suggests that most field faults in software applications are found in a smafi percentage of the software's components. This means that if these faulty software components can be detected early in the development project's life cycle, mitigating actions can be taken, such as a redesign. For objectoriented applications, prediction models using design metrics can be used to identify faulty classes early on. In this paper we report on a study that used objectoriented design metrics to construct such prediction models. The study used data collected from one version of a commercial Java application for constructing a prediction model. The model was then validated on a subsequent release of the same application. Our results indicate that the prediction model has a high accuracy. Furthermore, we found that an export coupling metric had the strongest association with faultproneness, indicating a structural feature that may be symptomatic of a class with a high probability of latent faults.
Bayesian Model Averaging in proportional hazard models: Assessing the risk of a stroke
 Applied Statistics
, 1997
"... Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for stroke. We introduce a technique based on the leaps and bounds algorithm which e ciently locates and ts the best models in the very large model space and thereby extends all subsets regression to Cox models. For each independent variable considered, the method provides the posterior probability that it belongs in the model. This is more directly interpretable than the corresponding Pvalues, and also more valid in that it takes account of model uncertainty. Pvalues from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable. In our data Bayesian model averaging predictively outperforms standard model selection methods for assessing
Contextual Advertising by Combining Relevance with Click Feedback
, 2008
"... Contextual advertising supports much of the Web’s ecosystem today. User experience and revenue (shared by the site publisher ad the ad network) depend on the relevance of the displayed ads to the page content. As with other document retrieval systems, relevance is provided by scoring the match betwe ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
Contextual advertising supports much of the Web’s ecosystem today. User experience and revenue (shared by the site publisher ad the ad network) depend on the relevance of the displayed ads to the page content. As with other document retrieval systems, relevance is provided by scoring the match between individual ads (documents) and the content of the page where the ads are shown (query). In this paper we show how this match can be improved significantly by augmenting the adpage scoring function with extra parameters from a logistic regression model on the words in the pages and ads. A key property of the proposed model is that it can be mapped to standard cosine similarity matching and is suitable for efficient and scalable implementation over inverted indexes. The model parameter values are learnt from logs containing ad impressions and clicks, with shrinkage estimators being used to combat sparsity. To scale our computations to train on an extremely large training corpus consisting of several gigabytes of data, we parallelize our fitting algorithm in a Hadoop [10] framework. Experimental evaluation is provided showing improved click prediction over a holdout set of impression and click events from a large scale realworld ad placement engine. Our best model achieves a 25 % lift in precision relative to a traditional information retrieval model which is based on cosine similarity, for recalling 10 % of the clicks in our test data.
Input selection and shrinkage in multiresponse linear regression
 Computational Statistics and Data Analysis
, 2007
"... The regression problem of modeling several response variables using the same set of input variables is considered. The model is linearly parameterized and the parameters are estimated by minimizing the error sum of squares subject to a sparsity constraint. The constraint has the effect of eliminatin ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
The regression problem of modeling several response variables using the same set of input variables is considered. The model is linearly parameterized and the parameters are estimated by minimizing the error sum of squares subject to a sparsity constraint. The constraint has the effect of eliminating useless inputs and constraining the parameters of the remaining inputs in the model. Two algorithms for solving the resulting convex cone programming problem are proposed. The first algorithm gives a pointwise solution, while the second one computes the entire path of solutions as a function of the constraint parameter. Based on experiments with real data sets, the proposed method has a similar performance to existing methods. In simulation experiments, the proposed method is competitive both in terms of prediction accuracy and correctness of input selection. The advantages become more apparent when many correlated inputs are available for model construction. © 2007 Elsevier B.V. All rights reserved.
Validation Of ObjectOriented Metrics
, 1999
"... Many objectoriented metrics have been proposed, and at least fourteen empirical validations of these metrics have been performed. However, recently it was noted that without controlling for the effect of class size in a validation study, the impact of a metric may be exaggerated. It thus becomes ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Many objectoriented metrics have been proposed, and at least fourteen empirical validations of these metrics have been performed. However, recently it was noted that without controlling for the effect of class size in a validation study, the impact of a metric may be exaggerated. It thus becomes necessary to revalidate contemporary objectoriented metrics after controlling for size. In this paper we perform a validation study on a telecommunications C++ system. We investigate 24 metrics proposed by Chidamber and Kemerer and Briand et al.. Our dependent variable was the incidence of faults due to field failures (faultproneness). Our results indicate that out of the 24 metrics (covering coupling, cohesion, inheritance, and complexity), only four are actually related to faults after controlling for class size, and that only two of these are useful for the construction of prediction models. The two selected metrics measure coupling. The best prediction model exhibits high accuracy.
Title: Why do we still use stepwise modelling in ecology and behaviour?
"... 1. The biases and shortcomings of stepwise multiple regression are well established within the statistical literature. However an examination of papers published in 2004 by three leading ecological and behavioural journals suggested that the use of this technique remains widespread: of 65 papers in ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
1. The biases and shortcomings of stepwise multiple regression are well established within the statistical literature. However an examination of papers published in 2004 by three leading ecological and behavioural journals suggested that the use of this technique remains widespread: of 65 papers in which a multiple regression approach was used, 57 % of studies used a stepwise procedure. 2. The principal drawbacks of stepwise multiple regression include bias in parameter estimation, inconsistencies among model selection algorithms, an inherent (but often overlooked) problem of multiple hypothesis testing, and an inappropriate focus or reliance on a single best model. We discuss each of these issue with examples. 3. We use a worked example of data on yellowhammer distribution collected over four years to highlight the pitfalls of stepwise regression. We show that stepwise regression allows models containing significant predictors to be obtained from each year’s data. In spite of the significance of the selected models, they vary substantially between years and suggest patterns that are at odds with those determined by analysing the full, four year data set. 4. An Information Theoretic (IT) analysis of the yellowhammer data set illustrates why the varying outcomes of stepwise analyses arise. In particular, the IT approach identifies large numbers of competing models that could describe the data equally well, showing that no one model should be relied upon for inference. 2
A Comparison of Logistic Regression to Decision Tree Induction in the Diagnosis of Carpal Tunnel Syndrome
 Computers and Biomedical Research
, 1999
"... This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classication categories. Initially, we present the classication performance results based on more than two covariates (multivari ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classication categories. Initially, we present the classication performance results based on more than two covariates (multivariate case). Our results suggest that there is no signicant dierence between the two methods. Further to this investigation, we present a detailed comparison of the structure of bivariate versions of the models. The rst surprising result of this analysis is that the classication accuracy of the bivariate models is slightly higher than that of the multivariate ones. In addition, the bivariate models lend themselves to graphical analysis, where the corresponding decision regions can easily be represented in the twodimensional covariate space. This analysis reveals important structural dierences between the two models. 2 1 INTRODUCTION In recent years, the family of methods suitable fo...
Validating ObjectOriented Design Metrics on a Commercial Java Application
, 2000
"... Many of the objectoriented metrics that have been developed by the research community are believed to measure some aspect of complexity. As such, they can serve as leading indicators of problematic classes, for example, those classes that are most faultprone. If faulty classes can be detected earl ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Many of the objectoriented metrics that have been developed by the research community are believed to measure some aspect of complexity. As such, they can serve as leading indicators of problematic classes, for example, those classes that are most faultprone. If faulty classes can be detected early in the development project's life cycle, mitigating actions can be taken, such as focused inspections. Prediction models using design metrics can be used to identify faulty classes early on. In this paper, we present a cognitive theory of objectoriented metrics and an empirical study which has as objectives to formally test this theory while validating the metrics and to build a postrelease faultproneness prediction model. The cognitive mechanisms which we apply in this study to objectoriented metrics are based on contemporary models of human memory. They are: familiarity, interference, and fan effects. Our empirical study was performed with data from a commercial Java application. We found that Depth of Inheritance Tree (DIT) is a good measure of familiarity and, as predicted, has a quadratic relationship with faultproneness. Our hypotheses were confirmed for Import Coupling to other classes, Export Coupling and Number of Children metrics. The Ancestor based Import Coupling metrics were not associated with faultproneness after controlling for the confounding effect of DIT. The prediction model constructed had a good accuracy. Finally, we formulated a cost savings model and applied it to our predictive model. This demonstrated a 42% reduction in postrelease costs if the prediction model is used to identify the classes that should be inspected.
Bayesian Regression with Input Noise for High Dimensional Data
"... This paper examines high dimensional regression with noisecontaminated input and output data. Goals of such learning problems include optimal prediction with noiseless query points and optimal system identification. As a first step, we focus on linear regression methods, since these can be easily c ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper examines high dimensional regression with noisecontaminated input and output data. Goals of such learning problems include optimal prediction with noiseless query points and optimal system identification. As a first step, we focus on linear regression methods, since these can be easily cast into nonlinear learning problems with locally weighted learning approaches. Standard linear regression algorithms generate biased regression estimates if input noise is present and suffer numerically when the data contains redundancy and irrelevancy. Inspired by Factor Analysis Regression, we develop a variational Bayesian algorithm that is robust to illconditioned data, automatically detects relevant features, and identifies input and output noise – all in a computationally efficient way. We demonstrate the effectiveness of our techniques on synthetic data and on a system identification task for a rigid body dynamics model of a robotic vision head. Our algorithm performs 10 to 70% better than previously suggested methods. 1.