Results 1  10
of
25
Greedy Function Approximation: A Gradient Boosting Machine
 Annals of Statistics
, 2000
"... Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additi ..."
Abstract

Cited by 951 (12 self)
 Add to MetaCart
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additive expansions based on any tting criterion. Specic algorithms are presented for least{squares, least{absolute{deviation, and Huber{M loss functions for regression, and multi{class logistic likelihood for classication. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such \TreeBoost" models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classication, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Frie...
Regularized discriminant analysis
 J. Amer. Statist. Assoc
, 1989
"... Linear and quadratic discriminant analysis are considered in the small sample highdimensional setting. Alternatives to the usual maximum likelihood (plugin) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customize ..."
Abstract

Cited by 460 (2 self)
 Add to MetaCart
Linear and quadratic discriminant analysis are considered in the small sample highdimensional setting. Alternatives to the usual maximum likelihood (plugin) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved. Submitted to Journal of the American Statistical Association
Bayes model averaging with selection of regressors
 Journal of the Royal Statistical Society. Series B, Statistical Methodology
, 2002
"... Summary. When a number of distinct models contend for use in prediction, the choice of a single model can offer rather unstable predictions. In regression, stochastic search variable selection with Bayesian model averaging offers a cure for this robustness issue but at the expense of requiring very ..."
Abstract

Cited by 58 (10 self)
 Add to MetaCart
Summary. When a number of distinct models contend for use in prediction, the choice of a single model can offer rather unstable predictions. In regression, stochastic search variable selection with Bayesian model averaging offers a cure for this robustness issue but at the expense of requiring very many predictors. Here we look at Bayes model averaging incorporating variable selection for prediction. This offers similar meansquare errors of prediction but with a vastly reduced predictor space. This can greatly aid the interpretation of the model. It also reduces the cost if measured variables have costs. The development here uses decision theory in the context of the multivariate general linear model. In passing, this reduced predictor space Bayes model averaging is contrasted with singlemodel approximations. A fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be contemplated. We discuss the merits of absolute rather than proportionate shrinkage in regression, especially when there are more variables than observations. The methodology is illustrated on a set of spectroscopic data used for measuring the amounts of different sugars in an aqueous solution.
Importance Sampled Learning Ensembles
, 2003
"... Learning a function of many arguments is viewed from the perspective of high dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to cor ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
Learning a function of many arguments is viewed from the perspective of high dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to correspond to random Monte Carlo integration methods each based on particular importance sampling strategies. Non random boosting methods are seen to correspond to deterministic quasi Monte Carlo integration techniques. This view helps explain some of their properties and suggests modifications to them that can substantially improve their accuracy while dramatically improving computational performance.
Bayesian Methods for Neural Networks: Theory and Applications
, 1995
"... this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor
QSAR with few Compounds and Many Features
 J. Chem. Inf. Comput. Sci
, 2001
"... Fitting quantitative structureactivity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the “shape ” of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset s ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
Fitting quantitative structureactivity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the “shape ” of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized crossvalidation to choose the ridge constant and perform Ftests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors. 1.
A Comparison of Logistic Regression to Decision Tree Induction in the Diagnosis of Carpal Tunnel Syndrome
 Computers and Biomedical Research
, 1999
"... This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classication categories. Initially, we present the classication performance results based on more than two covariates (multivari ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
This paper aims to compare and contrast two types of model (logistic regression and decision tree induction) for the diagnosis of carpal tunnel syndrome using four ordered classication categories. Initially, we present the classication performance results based on more than two covariates (multivariate case). Our results suggest that there is no signicant dierence between the two methods. Further to this investigation, we present a detailed comparison of the structure of bivariate versions of the models. The rst surprising result of this analysis is that the classication accuracy of the bivariate models is slightly higher than that of the multivariate ones. In addition, the bivariate models lend themselves to graphical analysis, where the corresponding decision regions can easily be represented in the twodimensional covariate space. This analysis reveals important structural dierences between the two models. 2 1 INTRODUCTION In recent years, the family of methods suitable fo...
A new approach to fitting linear models in high dimensional spaces
, 2000
"... This thesis presents a new approach to fitting linear models, called “pace regression”, which also overcomes the dimensionality determination problem. Its optimality in minimizing the expected prediction loss is theoretically established, when the number of free parameters is infinitely large. In th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
This thesis presents a new approach to fitting linear models, called “pace regression”, which also overcomes the dimensionality determination problem. Its optimality in minimizing the expected prediction loss is theoretically established, when the number of free parameters is infinitely large. In this sense, pace regression outperforms existing procedures for fitting linear models. Dimensionality determination, a special case of fitting linear models, turns out to be a natural byproduct. A range of simulation studies are conducted; the results support the theoretical analysis. Through the thesis, a deeper understanding is gained of the problem of fitting linear models. Many key issues are discussed. Existing procedures, namely OLS, AIC, BIC, RIC, CIC, CV(d), BS(m), RIDGE, NNGAROTTE and LASSO, are reviewed and compared, both theoretically and empirically, with the new methods. Estimating a mixing distribution is an indispensable part of pace regression. A measurebased minimum distance approach, including probability measures and nonnegative measures, is proposed, and strongly consistent estimators are produced. Of all minimum distance methods for estimating a mixing distribution, only the
Bayesian Prediction Using Adaptive Ridge Estimators
"... The Bayesian linear model framework has become increasingly popular building block in regression problems. It has been shown to produce models with good predictive power and can be used with basis functions that are nonlinear in the data to provide exible estimated regression functions. Further, ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The Bayesian linear model framework has become increasingly popular building block in regression problems. It has been shown to produce models with good predictive power and can be used with basis functions that are nonlinear in the data to provide exible estimated regression functions. Further, model uncertainty can be accounted for by Bayesian model averaging. We propose a more simple way to account for model uncertainty that is based on generalized ridge regression estimators. This is shown to predict well and to be much more computationally ecient than standard model averaging methods. Further, we demonstrate how to eciently mix over dierent sets of basis functions, letting the data determine which are most appropriate for the problem at hand. Keywords: Bayesian model averaging, generalized ridge regression, prediction, regression splines, shrinkage. 1