Results 1 - 10
of
19
BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING
- SUBMITTED TO STATISTICAL SCIENCE
"... We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akai ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.
RANDOM SURVIVAL FORESTS
, 2008
"... We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forest ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.
Unbiased split selection for classification trees based on the Gini Index
, 2006
"... The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion. 1
Survival prediction using gene expression data: a review and comparison. submitted
, 2007
"... Background: Knowledge of the transcription of the humane genome might greatly enhance our understanding of cancer. In particular, gene expression may be used to predict the survival of cancer patients. A microarray measures the expression of thousands of genes simultaneously. The high-dimensionality ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Background: Knowledge of the transcription of the humane genome might greatly enhance our understanding of cancer. In particular, gene expression may be used to predict the survival of cancer patients. A microarray measures the expression of thousands of genes simultaneously. The high-dimensionality of the data poses the following problem: the number of covariates (∼10000) greatly exceeds the number of samples (∼200). Results: Here we give an inventory of methods that have been used to model survival using gene expression. These methods are critically reviewed and compared in a qualitative way. Finally, the methods are applied to artificial and real-life datasets for a quantitative comparison. Conclusions: The choice of the evaluation measure of predictive performance is crucial for the selection of the best method. Depending on the evaluation measure, either the L2-penalized Cox regression or the random forest ensemble method yields the best survival time prediction using gene expression for the data sets used. Consensus, on which evaluation measure of predictive performance is best used, is much needed. 1 1
BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests
, 2008
"... This is an Open Access article distributed under the terms of the Creative Commons Attribution License ..."
Abstract
- Add to MetaCart
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
Article type: Focus Article Tree-Structured Classifier
"... bias A tree-structured classifier is a decision tree for predicting a class variable from one or more predictor variables. THAID [15, 7] was the first such algorithm. This article focuses on the CART R ○ [2], C4.5 [17], and GUIDE [12] methods. The algorithms are briefly reviewed and their similariti ..."
Abstract
- Add to MetaCart
bias A tree-structured classifier is a decision tree for predicting a class variable from one or more predictor variables. THAID [15, 7] was the first such algorithm. This article focuses on the CART R ○ [2], C4.5 [17], and GUIDE [12] methods. The algorithms are briefly reviewed and their similarities and differences compared on a real data set and by simulation. In a typical classification problem, we have a training sample L = {(X1, Y1), (X2, Y2),..., (XN, YN)} of N observations, where each X = (X1,..., XK) is a K-dimensional vector of predictor variables and Y is a class variable that takes one of J values. We want to construct a rule for predicting the Y value of a new observation given its value of X. If the predictor variables are all ordered, i.e., non-categorical, some popular classifiers are linear discriminant analysis (LDA), nearest neighbor, and support vector machines. (Categorical predictor variables can be accommodated by transformation to vectors of 0-1 dummy
Bias in Ensemble
"... Variable selection bias option: scale=FALSE Variable selection bias employed as a criterion for variable selection in many recent publications in biochemistry, neurology, forestry, etc., e.g. by ..."
Abstract
- Add to MetaCart
Variable selection bias option: scale=FALSE Variable selection bias employed as a criterion for variable selection in many recent publications in biochemistry, neurology, forestry, etc., e.g. by
Variable Selection Bias in Classification Trees and Ensemble Methods
"... e.g. when potential predictor variables vary in their number of categories. The variable selection bias evident for predictor variables with different numbers of categories in binary splitting algorithms is due to a multiple testing effect: When potential predictor variables vary in their number of ..."
Abstract
- Add to MetaCart
e.g. when potential predictor variables vary in their number of categories. The variable selection bias evident for predictor variables with different numbers of categories in binary splitting algorithms is due to a multiple testing effect: When potential predictor variables vary in their number of categories, and thus in their number of potential cutpoints, those variables that provide more potential cutpoints are more likely to be selected by chance. This effect can be demonstrated for the rpart routine, which is an implementation of the CART algorithm in R. Ensemble methods have been introduced to increase the prediction accuracy of weak base learners such as classification trees. However, when biased classification trees are employed as base learners in ensemble methods variable selection bias is carried forward. Simulation results are presented that show variable selection bias for the gbm routine for boosting and for the randomForests routine. Both ensemble methods provide variable importance measures for variable selection purposes that are biased when potential predictor variables vary in their number of categories: Unsurprisingly, variable importance measures that are based on the individual trees’
A Computational Investigation on Heuristic Algorithms for 2-Edge-Connectivity Augmentation
"... We consider the 2-edge-connectivity augmentation problem: given a graph S = (V, E) which is not 2-edge-connected and a set of new edges E ′ ⊆ V × V with non-negative weights, find a minimum cost subset X of E ′ such that adding the edges of X to S results in a 2-edge-connected graph. A practical ap ..."
Abstract
- Add to MetaCart
We consider the 2-edge-connectivity augmentation problem: given a graph S = (V, E) which is not 2-edge-connected and a set of new edges E ′ ⊆ V × V with non-negative weights, find a minimum cost subset X of E ′ such that adding the edges of X to S results in a 2-edge-connected graph. A practical application is the extension of an existing telecommunication network to become robust against single link failures. We compare, experimentally, different algorithms for solving general and large-scale instances. This includes exact methods based on mathematical programming, simple construction heuristics and metaheuristics. As part of the design of heuristics, we consider different neighborhood structures for local search, among which a very large scale neighborhood. In all cases, we exploit approaches through the graph formulation as well as through an equivalent set covering formulation. The results indicate that exact solutions by means of a basic integer programming model can be obtained in reasonably short time even on networks with 800 vertices and around 287.000 edges. Alternatively, an advanced heuristic algorithm based on subgradient optimization and iterated greedy finds often the optimal solution and is very fast. All previous benchmark instances are easily solved to optimality and new, larger, instances are introduced and studied.
CONTRIBUTED RESEARCH ARTICLES 45 Rattle: A Data Mining GUI for R
"... Abstract: Data mining delivers insights, patterns, and descriptive and predictive models from the large amounts of data available today in many organisations. The data miner draws heavily on methodologies, techniques and algorithms from statistics, machine learning, and computer science. R increasin ..."
Abstract
- Add to MetaCart
Abstract: Data mining delivers insights, patterns, and descriptive and predictive models from the large amounts of data available today in many organisations. The data miner draws heavily on methodologies, techniques and algorithms from statistics, machine learning, and computer science. R increasingly provides a powerful platform for data mining. However, scripting and programming is sometimes a challenge for data analysts moving into data mining. The Rattle package provides a graphical user interface specifically for data mining using R. It also provides a stepping stone toward using R as a programming language for data analysis.

