## Online Feature Selection Using Grafting (2003)

### Cached

### Download Links

- [nis-www.lanl.gov]
- [www.hpl.hp.com]
- [public.lanl.gov]
- [www.aaai.org]
- [www.aaai.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In International Conference on Machine Learning |

Citations: | 25 - 0 self |

### BibTeX

@INPROCEEDINGS{Perkins03onlinefeature,

author = {Simon Perkins and James Theiler},

title = {Online Feature Selection Using Grafting},

booktitle = {In International Conference on Machine Learning},

year = {2003},

pages = {592--599},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

In the standard feature selection problem, we are given a fixed set of candidate features for use in a learning problem, and must select a subset that will be used to train a model that is "as good as possible" according to some criterion. In this paper, we present an interesting and useful variant, the online feature selection problem, in which, instead of all features being available from the start, features arrive one at a time. The learner's task is to select a subset of features and return a corresponding model at each time step which is as good as possible given the features seen so far. We argue that existing feature selection methods do not perform well in this scenario, and describe a promising alternative method, based on a stagewise gradient descent technique which we call grafting.

### Citations

3436 | Libsvm: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/ ~cjlin/libsvm/ (accessed
- Chang, Lin
- 2009
(Show Context)
Citation Context ...Gaussian RBF kernel SVM with default libsvm kernel parameters, applied to all features in batch mode. The grafting algorithms were implemented in Matlab, while the SVM experiments made use of libsvm (=-=Chang & Lin, 2001-=-), written in C++. Regularization parameters — λ for the grafting experiments, C for the SVM experiments — were chosen using five-fold cross validation on each of the training sets. The non-linear mod... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...ugh C. Each dataset consists of a training set and a test set. Datasets A and B are synthetic problems, while dataset C is a real world problem, taken from the online UCI Machine Learning Repository (=-=Blake & Merz, 1998-=-). The two synthetic problems are variations of the threshold max (TM) problem (Perkins et al., 2003). In the most basic version of this problem, the feature space contains nr informative features, ea... |

2046 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...blem. In this paper we will deal with binary classification problems, with y taking values of ±1, and so a suitable loss function is the binomial negative log-likelihood, used in logistic regression (=-=Hastie et al., 2001-=-, ch. 4): Lbnll =ln(1+e −yf(x) )The BNLL loss function has several attractive properties. It is derived from a model that treats f(x) asthe log of the ratio of the probability that y =+1tothe probabi... |

1631 | Experiments with a New Boosting Algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...lds for all p<2. 3 The name is derived from “gradient feature testing”.Grafting is related to other stagewise modeling methods such as additive logistic regression (Friedman et al., 2000), boosting (=-=Freund & Schapire, 1996-=-) and matching pursuit (Mallat & Zhang, 1993). 4.1. Basic Approach Grafting is a general purpose technique that can work with a variety of models that are parameterized by a weight vector w, subject t... |

1291 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
(Show Context)
Citation Context ...aised to the p’th power, and so is usually called an ℓp regularizer. If p = 2, then the regularizer is equivalent to that used in ridge-regression (Hoerl & Kennard, 1970) and support vector machines (=-=Boser et al., 1992-=-). If p = 1, then the regularizer is the “lasso” (Tibshirani, 1994). If p → 0 then it counts the number of non-zero elements of w. The p = 1 lasso regularizer has some interesting properties. Firstly,... |

1219 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
(Show Context)
Citation Context ...2 In fact, this second property holds for all p<2. 3 The name is derived from “gradient feature testing”.Grafting is related to other stagewise modeling methods such as additive logistic regression (=-=Friedman et al., 2000-=-), boosting (Freund & Schapire, 1996) and matching pursuit (Mallat & Zhang, 1993). 4.1. Basic Approach Grafting is a general purpose technique that can work with a variety of models that are parameter... |

1081 | Practical Methods of Optimization - Fletcher - 1981 |

1048 | Matching pursuits with time-frequency dictionaries
- Mallat, Zhang
- 1993
(Show Context)
Citation Context ...adient feature testing”.Grafting is related to other stagewise modeling methods such as additive logistic regression (Friedman et al., 2000), boosting (Freund & Schapire, 1996) and matching pursuit (=-=Mallat & Zhang, 1993-=-). 4.1. Basic Approach Grafting is a general purpose technique that can work with a variety of models that are parameterized by a weight vector w, subject to ℓ1 regularization, and other un-regularize... |

1031 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ... arrival, and so we want touseamethodwhoseupdatetimedoesnotincrease without limit as more features are seen. Standard feature selection methods can be broadly divided into filter and wrapper methods (=-=Kohavi & John, 1997-=-). How do these approaches adapt to an online scenario? Filter methods typically use some kind of heuristic to estimate the relative importance of different features. They can be divided into two grou... |

495 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...e of regularizer is the familiar Minkowski ℓp norm raised to the p’th power, and so is usually called an ℓp regularizer. If p = 2, then the regularizer is equivalent to that used in ridge-regression (=-=Hoerl & Kennard, 1970-=-) and support vector machines (Boser et al., 1992). If p = 1, then the regularizer is the “lasso” (Tibshirani, 1994). If p → 0 then it counts the number of non-zero elements of w. The p = 1 lasso regu... |

355 | A Practical Approach to Feature Selection - Kira, Rendell - 1992 |

145 | Correlation-based feature selection for discrete and numeric class machine learning - Hall - 2000 |

78 | Grafting: Fast, incremental feature selection by gradient descent in function space
- Perkins, Lacker, et al.
- 2003
(Show Context)
Citation Context ...lems, while dataset C is a real world problem, taken from the online UCI Machine Learning Repository (Blake & Merz, 1998). The two synthetic problems are variations of the threshold max (TM) problem (=-=Perkins et al., 2003-=-). In the most basic version of this problem, the feature space contains nr informative features, each of which is uniformly distributed between -1 and +1. The output label y for a data point x is def... |

7 | Support Vector Machines for Broad Area Feature Extraction - Perkins, Harvey, et al. - 2001 |