Results 1  10
of
181
Rough Sets.
 Int. J. of Information and Computer Sciences
, 1982
"... Abstract. This article presents some general remarks on rough sets and their place in general picture of research on vagueness and uncertainty concepts of utmost interest, for many years, for philosophers, mathematicians, logicians and recently also for computer scientists and engineers particular ..."
Abstract

Cited by 793 (13 self)
 Add to MetaCart
Abstract. This article presents some general remarks on rough sets and their place in general picture of research on vagueness and uncertainty concepts of utmost interest, for many years, for philosophers, mathematicians, logicians and recently also for computer scientists and engineers particularly those working in such areas as AI, computational intelligence, intelligent systems, cognitive science, data mining and machine learning. Thus this article is intended to present some philosophical observations rather than to consider technical details or applications of rough set theory. Therefore we also refrain from presentation of many interesting applications and some generalizations of the theory.
Rough sets: some extensions,”
 Information Sciences,
, 2007
"... Abstract In this article, we present some extensions of the rough set approach and we outline a challenge for the rough set based research. ..."
Abstract

Cited by 87 (6 self)
 Add to MetaCart
(Show Context)
Abstract In this article, we present some extensions of the rough set approach and we outline a challenge for the rough set based research.
Conditional variable importance for random forests
, 2008
"... Random forests are becoming increasingly popular in many scientific fields because they can cope with“small n large p”problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene exp ..."
Abstract

Cited by 83 (3 self)
 Add to MetaCart
Random forests are becoming increasingly popular in many scientific fields because they can cope with“small n large p”problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach.
Statistical strategies for avoiding false discoveries in metabolomics and related experiments
, 2006
"... Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately ve ..."
Abstract

Cited by 61 (11 self)
 Add to MetaCart
(Show Context)
Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately very easy to find markers that are apparently persuasive but that are in fact entirely spurious, and there are wellknown examples in the proteomics literature. The main types of danger are not entirely independent of each other, but include bias, inadequate sample size (especially relative to the number of metabolite variables and to the required statistical power to prove that a biomarker is discriminant), excessive false discovery rate due to multiple hypothesis testing, inappropriate choice of particular numerical methods, and overfitting (generally caused by the failure to perform adequate validation and crossvalidation). Many studies fail to take these into account, and thereby fail to discover anything of true significance (despite their claims). We summarise these problems, and provide pointers to a substantial existing literature that should assist in the improved design and evaluation of metabolomics experiments, thereby allowing robust scientific conclusions to be drawn from the available data. We provide a list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact, and suggest a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers. These tools can be applied to individual metabolites by using multiple univariate tests performed in parallel across all metabolite peaks. They may also be applied to the validation of multivariate models. We stress in
Variable selection in data mining: Building a predictive model for bankruptcy
 Journal of the American Statistical Association
, 2004
"... We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of creditcard activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and ..."
Abstract

Cited by 52 (10 self)
 Add to MetaCart
We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our data set of 2.9 million months of creditcard activity. We use stepwise selection to find predictors from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates pvalues to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well, if not better, than recently developed datamining tools. When sorted, the largest 14,000 resulting predictions hold 1000 of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the treebased classifier C4.5. Key Phrases: AIC, Cp, Bonferroni, calibration, hard thresholding, risk inflation criterion (RIC),
Regression approaches for microarray data analysis
 Journal of Computational Biology
"... A variety of new procedures have been devised to handle the twosample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some de � ning characteristics of microarraybased studies: (i) the very large ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
A variety of new procedures have been devised to handle the twosample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some de � ning characteristics of microarraybased studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We brie � y critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarraybased study of cardiomyopathy in transgenic mice.
Supplement to “Time series analysis via mechanistic models”.
 Ann. Appl. Statist., Supporting
, 2008
"... The purpose of time series analysis via mechanistic models is to reconcile the known or hypothesized structure of a dynamical system with observations collected over time. We develop a framework for constructing nonlinear mechanistic models and carrying out inference. Our framework permits the cons ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
(Show Context)
The purpose of time series analysis via mechanistic models is to reconcile the known or hypothesized structure of a dynamical system with observations collected over time. We develop a framework for constructing nonlinear mechanistic models and carrying out inference. Our framework permits the consideration of implicit dynamic models, meaning statistical models for stochastic dynamical systems which are specified by a simulation algorithm to generate sample paths. Inference procedures that operate on implicit models are said to have the plugandplay property. Our work builds on recently developed plugandplay inference methodology for partially observed Markov models. We introduce a class of implicitly specified Markov chains with stochastic transition rates, and we demonstrate its applicability to open problems in statistical inference for biological systems. As one example, these models are shown to give a fresh perspective on measles transmission dynamics. As a second example, we present a mechanistic analysis of cholera incidence data, involving interaction between two competing strains of the pathogen Vibrio cholerae. 1. Introduction. A dynamical system is a process whose state varies with time. A mechanistic approach to understanding such a system is to write down equations, based on scientific understanding of the system, which describe how it evolves with time. Further equations describe the relationship of the state of the system to available observations on it. Mechanistic time series analysis concerns drawing inferences from the available data about the hypothesized equations
The design and analysis of benchmark experiments
 J Comp Graph Stat
, 2005
"... The assessment of the performance of learners by means of benchmark experiments is an established exercise. In practice, benchmark studies are a tool to compare the performance of several competing algorithms for a certain learning problem. Crossvalidation or resampling techniques are commonly used ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
The assessment of the performance of learners by means of benchmark experiments is an established exercise. In practice, benchmark studies are a tool to compare the performance of several competing algorithms for a certain learning problem. Crossvalidation or resampling techniques are commonly used to derive point estimates of the performances which are compared to identify algorithms with good properties. For several benchmarking problems, test procedures taking the variability of those point estimates into account have been suggested. Most of the recently proposed inference procedures are based on special variance estimators for the crossvalidated performance. We introduce a theoretical framework for inference problems in benchmark experiments and show that standard statistical test procedures can be used to test for differences in the performances. The theory is based on well defined distributions of performance measures which can be compared with established tests. To demonstrate the usefulness in practice, the theoretical results are applied to regression and classification benchmark studies based on artificial and real world data.
Hierarchical Testing of Variable Importance
"... A frequently encountered challenge in highdimensional regression is the detection of relevant variables. Variable selection suffers from instability and the power to detect relevant variables is typically low if predictor variables are highly correlated. When taking the multiplicity of the testing ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
(Show Context)
A frequently encountered challenge in highdimensional regression is the detection of relevant variables. Variable selection suffers from instability and the power to detect relevant variables is typically low if predictor variables are highly correlated. When taking the multiplicity of the testing problem into account, the power diminishes even further. To gain power and insight, it can be advantageous to look for influence not at the level of individual variables but rather at the level of clusters of highly correlated variables. We propose a hierarchical approach. Variable importance is first tested at the coarsest level, corresponding to the global null hypothesis. If possible, the method tries then to attribute any effect to smaller subclusters or even individual variables. The smallest possible clusters which still exhibit a significant influence on the response variable are retained. It is shown that the proposed testing procedure controls the familywise error rate at a prespecified level, simultaneously over all resolution levels. The method has comparable power to BonferroniHolm on the level of individual variables and dramatically larger power for coarser resolution levels. The best resolution level is selected adaptively.