Results 1  10
of
66
Where mathematics meets the Internet
 Notices of the American Mathematical Society
, 1998
"... The Internet has experienced a fascinating evolution in the recent past, especially since the early days of the Web, a fact welldocumented not only in the trade journals, but also in the popular press. Unprecedented in its growth, unparalleled in its heterogeneity, and unpredictable or even chaotic ..."
Abstract

Cited by 97 (8 self)
 Add to MetaCart
The Internet has experienced a fascinating evolution in the recent past, especially since the early days of the Web, a fact welldocumented not only in the trade journals, but also in the popular press. Unprecedented in its growth, unparalleled in its heterogeneity, and unpredictable or even chaotic in the behavior of its tra c, \the Internet is its own revolution", as AnthonyMichael Rutkowski, former Executive Director of the Internet Society, likes to put it.
Variable selection in data mining: Building a predictive model for bankruptcy
 Journal of the American Statistical Association
, 2004
"... We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of creditcard activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of creditcard activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates pvalues to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well, if not better, than recently developed datamining tools. When sorted, the largest 14,000 resulting predictions hold 1000 of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the treebased classifier C4.5. Key Phrases: AIC, Cp, Bonferroni, calibration, hard thresholding, risk inflation criterion (RIC),
Statistical Themes and Lessons for Data Mining
, 1997
"... Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statist ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statistical themes and lessons that are directly relevant to data mining and attempts to identify opportunities where close cooperation between the statistical and computational communities might reasonably provide synergy for further progress in data analysis.
Time Series Forecasting with Neural Networks: A Case Study
, 1995
"... This paper describes a case study which aims to do just that. ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
This paper describes a case study which aims to do just that.
Understanding the sources of variation in software inspections
 ACM Transactions on Software Engineering and Methodology
, 1998
"... In a previous experiment, we determined how various changes in three structural elements of the software inspection process (team size and the number and sequencing of sessions) altered effectiveness and interval. Our results showed that such changes did not significantly influence the defect detect ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
In a previous experiment, we determined how various changes in three structural elements of the software inspection process (team size and the number and sequencing of sessions) altered effectiveness and interval. Our results showed that such changes did not significantly influence the defect detection rate, but that certain combinations of changes dramatically increased the inspection interval. We also observed a large amount of unexplained variance in the data, indicating that other factors must be affecting inspection performance. The nature and extent of these other factors now have to be determined to ensure that they had not biased our earlier results. Also, identifying these other factors might suggest additional ways to improve the efficiency of inspections. Acting on the hypothesis that the “inputs ” into the inspection process (reviewers, authors, and code units) were significant sources of variation, we modeled their effects on inspection performance. We found that they were responsible for much more variation in defect detection than was process structure. This leads us to conclude that better defect detection techniques, not better process structures, are the key to improving inspection effectiveness. The combined effects of process inputs and process structure on the inspection interval accounted for only a small percentage of the variance in inspection interval. Therefore, there must be other factors which need to be identified.
Bayesian Model Averaging in proportional hazard models: Assessing the risk of a stroke
 Applied Statistics
, 1997
"... Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Evaluating the risk of stroke is important in reducing the incidence of this devastating disease. Here, we apply Bayesian model averaging to variable selection in Cox proportional hazard models in the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for stroke. We introduce a technique based on the leaps and bounds algorithm which e ciently locates and ts the best models in the very large model space and thereby extends all subsets regression to Cox models. For each independent variable considered, the method provides the posterior probability that it belongs in the model. This is more directly interpretable than the corresponding Pvalues, and also more valid in that it takes account of model uncertainty. Pvalues from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable. In our data Bayesian model averaging predictively outperforms standard model selection methods for assessing
A Discussion of Parameter and Model Uncertainty in Insurance
 in Insurance,” Insurance: Mathematics and Economics
, 2000
"... In this paper we consider the process of modelling uncertainty. In particular we are concerned with making inferences about some quantity of interest which, at present, has been unobserved. Examples of such a quantity include the probability of ruin of a surplus process, the accumulation of an inves ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
In this paper we consider the process of modelling uncertainty. In particular we are concerned with making inferences about some quantity of interest which, at present, has been unobserved. Examples of such a quantity include the probability of ruin of a surplus process, the accumulation of an investment, the level or surplus or deficit in a pension fund and the future volume of new business in an insurance company. Uncertainty in this quantity of interest, y, arises from three sources: . uncertainty due to the stochastic nature of a given model; . uncertainty in the values of the parameters in a given model; . uncertainty in the model underlying what we are able to observe and determining the quantity of interest. It is common in actuarial science to find that the first source of uncertainty is the only one which receives rigorous attention. A limited amount of research in recent years has considered the effect of parameter uncertainty, while there is still considerable scope ...
Statistical strategies for avoiding false discoveries in metabolomics and related experiments
, 2006
"... Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately ve ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately very easy to find markers that are apparently persuasive but that are in fact entirely spurious, and there are wellknown examples in the proteomics literature. The main types of danger are not entirely independent of each other, but include bias, inadequate sample size (especially relative to the number of metabolite variables and to the required statistical power to prove that a biomarker is discriminant), excessive false discovery rate due to multiple hypothesis testing, inappropriate choice of particular numerical methods, and overfitting (generally caused by the failure to perform adequate validation and crossvalidation). Many studies fail to take these into account, and thereby fail to discover anything of true significance (despite their claims). We summarise these problems, and provide pointers to a substantial existing literature that should assist in the improved design and evaluation of metabolomics experiments, thereby allowing robust scientific conclusions to be drawn from the available data. We provide a list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact, and suggest a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers. These tools can be applied to individual metabolites by using multiple univariate tests performed in parallel across all metabolite peaks. They may also be applied to the validation of multivariate models. We stress in