Results 1 - 10
of
49
Where mathematics meets the Internet
- Notices of the American Mathematical Society
, 1998
"... The Internet has experienced a fascinating evolution in the recent past, especially since the early days of the Web, a fact well-documented not only in the trade journals, but also in the popular press. Unprecedented in its growth, unparalleled in its heterogeneity, and unpredictable or even chaotic ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
The Internet has experienced a fascinating evolution in the recent past, especially since the early days of the Web, a fact well-documented not only in the trade journals, but also in the popular press. Unprecedented in its growth, unparalleled in its heterogeneity, and unpredictable or even chaotic in the behavior of its tra c, \the Internet is its own revolution", as Anthony-Michael Rutkowski, former Executive Director of the Internet Society, likes to put it.
Statistical Themes and Lessons for Data Mining
, 1997
"... Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statist ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statistical themes and lessons that are directly relevant to data mining and attempts to identify opportunities where close cooperation between the statistical and computational communities might reasonably provide synergy for further progress in data analysis.
Understanding the sources of variation in software inspections
- ACM Transactions on Software Engineering and Methodology
, 1998
"... In a previous experiment, we determined how various changes in three structural elements of the software inspection process (team size and the number and sequencing of sessions) altered effectiveness and interval. Our results showed that such changes did not significantly influence the defect detect ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
In a previous experiment, we determined how various changes in three structural elements of the software inspection process (team size and the number and sequencing of sessions) altered effectiveness and interval. Our results showed that such changes did not significantly influence the defect detection rate, but that certain combinations of changes dramatically increased the inspection interval. We also observed a large amount of unexplained variance in the data, indicating that other factors must be affecting inspection performance. The nature and extent of these other factors now have to be determined to ensure that they had not biased our earlier results. Also, identifying these other factors might suggest additional ways to improve the efficiency of inspections. Acting on the hypothesis that the “inputs ” into the inspection process (reviewers, authors, and code units) were significant sources of variation, we modeled their effects on inspection performance. We found that they were responsible for much more variation in defect detection than was process structure. This leads us to conclude that better defect detection techniques, not better process structures, are the key to improving inspection effectiveness. The combined effects of process inputs and process structure on the inspection interval accounted for only a small percentage of the variance in inspection interval. Therefore, there must be other factors which need to be identified.
Variable selection in data mining: Building a predictive model for bankruptcy
- Journal of the American Statistical Association
, 2004
"... We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of credit-card activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of credit-card activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well, if not better, than recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1000 of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5. Key Phrases: AIC, Cp, Bonferroni, calibration, hard thresholding, risk inflation criterion (RIC),
Bayesian Model Averaging in proportional hazard models: Assessing the risk of a stroke
- Applied Statistics
, 1997
"... Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for stroke. We introduce a technique based on the leaps and bounds algorithm which e ciently locates and ts the best models in the very large model space and thereby extends all subsets regression to Cox models. For each independent variable considered, the method provides the posterior probability that it belongs in the model. This is more directly interpretable than the corresponding P-values, and also more valid in that it takes account of model uncertainty. P-values from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable. In our data Bayesian model averaging predictively outperforms standard model selection methods for assessing
Variable selection and Bayesian model averaging in case-control studies
, 1998
"... Covariate and confounder selection in case-control studies is most commonly carried out using either a two-step method or a stepwise variable selection method in logistic regression. Inference is then carried out conditionally on the selected model, but this ignores the model uncertainty implicit in ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
Covariate and confounder selection in case-control studies is most commonly carried out using either a two-step method or a stepwise variable selection method in logistic regression. Inference is then carried out conditionally on the selected model, but this ignores the model uncertainty implicit in the variable selection process, and so underestimates uncertainty about relative risks. We report on a simulation study designed to be similar to actual case-control studies. This shows that p-values computed after variable selection can greatly overstate the strength of conclusions. For example, for our simulated case-control studies with 1,000 subjects, of variables declared to be "significant" with p-values between.01 and.05, only 49 % actually were risk factors when stepwise variable selection was used. We propose Bayesian model averaging as a formal way of taking account of model uncertainty in case-control studies. This yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and our simulation study indicates this to be reasonably well calibrated in the situations simulated. The methods are applied and compared
Time Series Forecasting with Neural Networks: A Case Study
, 1995
"... This paper describes a case study which aims to do just that. ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
This paper describes a case study which aims to do just that.
A Process-Oriented Heuristic for Model Selection
, 1998
"... Current methods to avoid overfitting are either data-oriented (using separate data for validation) or representation-oriented (penalizing complexity in the model). This paper proposes process-oriented evaluation, where a model's expected generalization error is computed as a function of the search p ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Current methods to avoid overfitting are either data-oriented (using separate data for validation) or representation-oriented (penalizing complexity in the model). This paper proposes process-oriented evaluation, where a model's expected generalization error is computed as a function of the search process that led to it. The paper develops the necessary theoretical framework, and applies it to one type of learning: rule induction. A process-oriented version of the CN2 rule learner is empirically compared with the default CN2. The process-oriented version is more accurate in a large majority of the datasets, with high significance, and also produces simpler models. Experiments in artificial domains suggest that processoriented evaluation is particularly useful in high-dimensional domains. 1 INTRODUCTION Overfitting avoidance is often considered the central problem of machine learning (e.g., (Cheeseman & Oldford, 1994)). If a learner is sufficiently powerful, it must guard against selec...

