Results 11  20
of
5,771
Least angle regression
 Ann. Statist
"... The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to s ..."
Abstract

Cited by 1308 (43 self)
 Add to MetaCart
The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising
An introduction to variable and feature selection
 Journal of Machine Learning Research
, 2003
"... Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. ..."
Abstract

Cited by 1283 (16 self)
 Add to MetaCart
(Show Context)
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.
Ideal spatial adaptation by wavelet shrinkage
 Biometrika
, 1994
"... With ideal spatial adaptation, an oracle furnishes information about how best to adapt a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable knot spline, or variable bandwidth kernel, to the unknown function. Estimation with the aid of an oracle o ers dramatic ad ..."
Abstract

Cited by 1251 (5 self)
 Add to MetaCart
With ideal spatial adaptation, an oracle furnishes information about how best to adapt a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable knot spline, or variable bandwidth kernel, to the unknown function. Estimation with the aid of an oracle o ers dramatic advantages over traditional linear estimation by nonadaptive kernels � however, it is a priori unclear whether such performance can be obtained by a procedure relying on the data alone. We describe a new principle for spatiallyadaptive estimation: selective wavelet reconstruction. Weshowthatvariableknot spline ts and piecewisepolynomial ts, when equipped with an oracle to select the knots, are not dramatically more powerful than selective wavelet reconstruction with an oracle. We develop a practical spatially adaptive method, RiskShrink, which works by shrinkage of empirical wavelet coe cients. RiskShrink mimics the performance of an oracle for selective wavelet reconstruction as well as it is possible to do so. A new inequality inmultivariate normal decision theory which wecallthe oracle inequality shows that attained performance di ers from ideal performance by at most a factor 2logn, where n is the sample size. Moreover no estimator can give a better guarantee than this. Within the class of spatially adaptive procedures, RiskShrink is essentially optimal. Relying only on the data, it comes within a factor log 2 n of the performance of piecewise polynomial and variableknot spline methods equipped with an oracle. In contrast, it is unknown how or if piecewise polynomial methods could be made to function this well when denied access to an oracle and forced to rely on data alone.
A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection
 INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1995
"... We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), te ..."
Abstract

Cited by 1248 (12 self)
 Add to MetaCart
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), tenfold crossvalidation may be better than the more expensive leaveoneout crossvalidation. We report on a largescale experiment  over half a million runs of C4.5 and a NaiveBayes algorithm  to estimate the effects of different parameters on these algorithms on realworld datasets. For crossvalidation, we vary the number of folds and whether the folds are stratified or not; for bootstrap, we vary the number of bootstrap samples. Our results indicate that for realword datasets similar to ours, the best method to use for model selection is tenfold stratified cross validation, even if computation power allows using more folds.
An introduction to ROC analysis
, 2006
"... Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are appa ..."
Abstract

Cited by 1024 (1 self)
 Add to MetaCart
(Show Context)
Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Statistical pattern recognition: A review
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2000
"... The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques ..."
Abstract

Cited by 998 (30 self)
 Add to MetaCart
(Show Context)
The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have bean receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the wellknown methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.
Greedy Function Approximation: A Gradient Boosting Machine
 Annals of Statistics
, 2000
"... Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additi ..."
Abstract

Cited by 951 (12 self)
 Add to MetaCart
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest{descent minimization. A general gradient{descent \boosting" paradigm is developed for additive expansions based on any tting criterion. Specic algorithms are presented for least{squares, least{absolute{deviation, and Huber{M loss functions for regression, and multi{class logistic likelihood for classication. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such \TreeBoost" models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classication, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Frie...
Learning logical definitions from relations
 MACHINE LEARNING
, 1990
"... This paper describes FOIL, a system that learns Horn clauses from data expressed as relations. FOIL is based on ideas that have proved effective in attributevalue learning systems, but extends them to a firstorder formalism. This new system has been applied successfully to several tasks taken fro ..."
Abstract

Cited by 930 (8 self)
 Add to MetaCart
This paper describes FOIL, a system that learns Horn clauses from data expressed as relations. FOIL is based on ideas that have proved effective in attributevalue learning systems, but extends them to a firstorder formalism. This new system has been applied successfully to several tasks taken from the machine learning literature.
TransformationBased ErrorDriven Learning and Natural Language Processing: A Case Study in PartofSpeech Tagging
 Computational Linguistics
, 1995
"... this paper, we will describe a simple rulebased approach to automated learning of linguistic knowledge. This approach has been shown for a number of tasks to capture information in a clearer and more direct fashion without a compromise in performance. We present a detailed case study of this learni ..."
Abstract

Cited by 916 (7 self)
 Add to MetaCart
(Show Context)
this paper, we will describe a simple rulebased approach to automated learning of linguistic knowledge. This approach has been shown for a number of tasks to capture information in a clearer and more direct fashion without a compromise in performance. We present a detailed case study of this learning method applied to part of speech tagging
Boosting the margin: A new explanation for the effectiveness of voting methods
 IN PROCEEDINGS INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1997
"... One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this ..."
Abstract

Cited by 896 (52 self)
 Add to MetaCart
One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the biasvariance decomposition.