Results 1  10
of
33
Regression Trees With Unbiased Variable Selection and Interaction Detection
 STATISTICA SINICA
, 2002
"... We propose an algorithm for regression tree construction called GUIDE. It is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure. GUIDE controls bias by employing chisquare analysis of residuals and bootstrap c ..."
Abstract

Cited by 56 (14 self)
 Add to MetaCart
We propose an algorithm for regression tree construction called GUIDE. It is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure. GUIDE controls bias by employing chisquare analysis of residuals and bootstrap calibration of significance probabilities. This approach allows fast computation speed, natural extension to data sets with categorical variables, and direct detection of local twovariable interactions. Previous algorithms are not unbiased and are insensitive to local interactions during split selection. The speed of GUIDE enables two further enhancements—complex modeling at the terminal nodes, such as polynomial or best simple linear models, and bagging. In an experiment with real data sets, the prediction mean square error of the piecewise constant GUIDE model is within ±20 % of that of CART�. Piecewise linear GUIDE models are more accurate; with bagging they can outperform the splinebased MARS � method.
Unbiased recursive partitioning: A conditional inference framework
 JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
, 2006
"... Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: overfitting and a selection bias towards covariates with many possible splits or missing values. While ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously affects the interpretability of treestructured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds treestructured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the models induced by both approaches are structurally different, confirming the need for an unbiased variable selection. Moreover, it is shown that the prediction accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection. The methodology presented here is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Data from studies on glaucoma classification, node positive breast cancer survival and mammography experience are reanalyzed.
Unbiased split selection for classification trees based on the Gini Index
, 2006
"... The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting pvalue as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation and real data studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the crossentropy criterion. 1
Anytime learning of decision trees
 Journal of Machine Learning Research
"... The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few nongreedy learners cannot learn good trees when the concept is diff ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few nongreedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pregiven allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available.
IMPROVING THE PRECISION OF CLASSIFICATION TREES
, 2009
"... Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond l ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond local main effects at each node of the tree. The resulting models may include many irrelevant variables or select too few of the important ones. Either eventuality can lead to erroneous conclusions. Four techniques to improve the precision of the models are proposed and their effectiveness compared with that of other algorithms, including tree ensembles, on real and simulated data sets.
A note on split selection bias in classification trees
 Computational Statistics and Data Analysis
, 2004
"... A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into correspo ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into corresponding subnodes. However, this greedy method is biased in variable selection when the numbers of the available split points for each variable are different. Such result may thus hamper the intuitively appealing nature of classification trees. The problem of the split selection bias for twoclass tasks with numerical predictors is examined. The statistical explanation of its existence is given and a solution based on the Pvalues is provided, when the Pearson chisquare statistic is used as the splitting criterion. keyword Cramér V 2 statistic; KolmogorovSmirnov statistic, Pvalue; Pearson chisquare statistic 1
Regression Tree Models for Designed Experiments
 IMS Lecture
"... Abstract: Although regression trees were originally designed for large datasets, they can profitably be used on small datasets as well, including those from replicated or unreplicated complete factorial experiments. We show that in the latter situations, regression tree models can provide simpler an ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract: Although regression trees were originally designed for large datasets, they can profitably be used on small datasets as well, including those from replicated or unreplicated complete factorial experiments. We show that in the latter situations, regression tree models can provide simpler and more intuitive interpretations of interaction effects as differences between conditional main effects. We present simulation results to verify that the models can yield lower prediction mean squared errors than the traditional techniques. The tree models span a wide range of sophistication, from piecewise constant to piecewise simple and multiple linear, and from least squares to Poisson and logistic regression. 1.
Statistical sources of variable selection bias in classification tree algorithms based on the gini index
, 2005
"... Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the splitting criterion when plugin estimates of entropy measures like the Gini Index are employed. The relevance of these sources of variable selection bias in the different simulation study designs is examined. Variable selection bias due to the explored sources applies to all classification tree algorithms based on empirical entropy measures like the Gini Index, Deviance and Information Gain, and to both binary and multiway splitting algorithms.
Mining event histories: a social science perspective
"... Abstract We explore how recent dataminingbased tools developed in domains such as biomedicine or textmining for extracting interesting knowledge from sequence data could be applied to personal life course data. We focus on two types of approaches: ‘Survival ’ trees that attempt to partition the d ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract We explore how recent dataminingbased tools developed in domains such as biomedicine or textmining for extracting interesting knowledge from sequence data could be applied to personal life course data. We focus on two types of approaches: ‘Survival ’ trees that attempt to partition the data into homogeneous groups regarding their survival characteristics, i.e the duration until a given event occurs, and the mining of typical discriminating episodes. We show how these approaches may fruitfully complement the outcome of more classical event history analyses and single out some specific issues raised by their application to sociodemographic data.