Results 1  10
of
66
Gene selection for cancer classification using support vector machines
 Machine Learning
"... Abstract. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new microarray devices generate bewildering amounts of raw data, new analytical methods must ..."
Abstract

Cited by 746 (25 self)
 Add to MetaCart
Abstract. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new microarray devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA microarrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leaveoneout error, while 64 genes are necessary for the baseline method to get the best result (one leaveoneout error). In the colon cancer database, using only 4 genes our method is 98 % accurate, while the baseline method is only 86 % accurate.
Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV
, 1998
"... this paper we very briefly review some of these results. RKHS can be chosen tailored to the problem at hand in many ways, and we review a few of them, including radial basis function and smoothing spline ANOVA spaces. Girosi (1997), Smola and Scholkopf (1997), Scholkopf et al (1997) and others have ..."
Abstract

Cited by 162 (12 self)
 Add to MetaCart
this paper we very briefly review some of these results. RKHS can be chosen tailored to the problem at hand in many ways, and we review a few of them, including radial basis function and smoothing spline ANOVA spaces. Girosi (1997), Smola and Scholkopf (1997), Scholkopf et al (1997) and others have noted the relationship between SVM's and penalty methods as used in the statistical theory of nonparametric regression. In Section 1.2 we elaborate on this, and show how replacing the likelihood functional of the logit (log odds ratio) in penalized likelihood methods for Bernoulli [yesno] data, with certain other functionals of the logit (to be called SVM functionals) results in several of the SVM's that are of modern research interest. The SVM functionals we consider more closely resemble a "goodnessoffit" measured by classification error than a "goodnessoffit" measured by the comparative KullbackLiebler distance, which is frequently associated with likelihood functionals. This observation is not new or profound, but it is hoped that the discussion here will help to bridge the conceptual gap between classical nonparametric regression via penalized likelihood methods, and SVM's in RKHS. Furthermore, since SVM's can be expected to provide more compact representations of the desired classification boundaries than boundaries based on estimating the logit by penalized likelihood methods, they have potential as a prescreening or model selection tool in sifting through many variables or regions of attribute space to find influential quantities, even when the ultimate goal is not classification, but to understand how the logit varies as the important variables change throughout their range. This is potentially applicable to the variable/model selection problem in demographic m...
Use of the ZeroNorm With Linear Models and Kernel Methods
, 2002
"... We explore the use of the socalled zeronorm of the parameters of linear models in learning. ..."
Abstract

Cited by 122 (3 self)
 Add to MetaCart
We explore the use of the socalled zeronorm of the parameters of linear models in learning.
A Sparse Signal Reconstruction Perspective for Source Localization With Sensor Arrays
 M.S. thesis, Mass. Inst. Technol
, 2003
"... Abstract—We present a source localization method based on a sparse representation of sensor measurements with an overcomplete basis composed of samples from the array manifold. We enforce sparsity by imposing penalties based on the 1norm. A number of recent theoretical results on sparsifying proper ..."
Abstract

Cited by 121 (5 self)
 Add to MetaCart
Abstract—We present a source localization method based on a sparse representation of sensor measurements with an overcomplete basis composed of samples from the array manifold. We enforce sparsity by imposing penalties based on the 1norm. A number of recent theoretical results on sparsifying properties of 1 penalties justify this choice. Explicitly enforcing the sparsity of the representation is motivated by a desire to obtain a sharp estimate of the spatial spectrum that exhibits superresolution. We propose to use the singular value decomposition (SVD) of the data matrix to summarize multiple time or frequency samples. Our formulation leads to an optimization problem, which we solve efficiently in a secondorder cone (SOC) programming framework by an interior point implementation. We propose a grid refinement method to mitigate the effects of limiting estimates to a grid of spatial locations and introduce an automatic selection criterion for the regularization parameter involved in our approach. We demonstrate the effectiveness of the method on simulated data by plots of spatial spectra and by comparing the estimator variance to the Cramér–Rao bound (CRB). We observe that our approach has a number of advantages over other source localization techniques, including increased resolution, improved robustness to noise, limitations in data quantity, and correlation of the sources, as well as not requiring an accurate initialization. Index Terms—Directionofarrival estimation, overcomplete representation, sensor array processing, source localization, sparse representation, superresolution. I.
An affine scaling methodology for best basis selection
 IEEE Trans. Signal Processing
, 1999
"... Abstract — A methodology is developed to derive algorithms for optimal basis selection by minimizing diversity measures proposed by Wickerhauser and Donoho. These measures include the pnormlike (`(p 1)) diversity measures and the Gaussian and Shannon entropies. The algorithm development methodolog ..."
Abstract

Cited by 88 (14 self)
 Add to MetaCart
Abstract — A methodology is developed to derive algorithms for optimal basis selection by minimizing diversity measures proposed by Wickerhauser and Donoho. These measures include the pnormlike (`(p 1)) diversity measures and the Gaussian and Shannon entropies. The algorithm development methodology uses a factored representation for the gradient and involves successive relaxation of the Lagrangian necessary condition. This yields algorithms that are intimately related to the Affine Scaling Transformation (AST) based methods commonly employed by the interior point approach to nonlinear optimization. The algorithms minimizing the `(p 1) diversity measures are equivalent to a recently developed class of algorithms called FOCal Underdetermined System Solver (FOCUSS). The general nature of the methodology provides a systematic approach for deriving this class of algorithms and a natural mechanism for extending them. It also facilitates a better understanding of the convergence behavior and a strengthening of the convergence results. The Gaussian entropy minimization algorithm is shown to be equivalent to a wellbehaved p =0normlike optimization algorithm. Computer experiments demonstrate that the pnormlike and the Gaussian entropy algorithms perform well, converging to sparse solutions. The Shannon entropy algorithm produces solutions that are concentrated but are shown to not converge to a fully sparse solution. I.
Feature Selection in Unsupervised Learning via Evolutionary Search
 In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... Feature subset selection is an important problem in knowl edge discovery, not only for the insight gained from deter mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
Feature subset selection is an important problem in knowl edge discovery, not only for the insight gained from deter mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of feature selection for unsupervised learning. A number of heuristic criteria can be used to estimate the quality of clusters built from a given featuresubset. Rather than combining such criteria, we use ELSA, an evolutionary lo cal selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi dimensional objectiv espace. Each evolved solution repre sents a feature subset and a number of clusters; a standard Kmeans algorithm is applied to form the given n umber of clusters based on the selected features. Preliminary results on both real and synthetic data show promise in finding Paretooptimal solutions through which we can identify the significant features and the correct number of clusters.
Generalized Kernel Approach to Dissimilaritybased Classification
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2001
"... Usually, objects to be classified are represented by features. In this paper, we discuss an alternative object representation based on dissimilarity values. If such distances separate the classes well, the nearest neighbor method offers a good solution. However, dissimilarities used in practice are ..."
Abstract

Cited by 59 (2 self)
 Add to MetaCart
Usually, objects to be classified are represented by features. In this paper, we discuss an alternative object representation based on dissimilarity values. If such distances separate the classes well, the nearest neighbor method offers a good solution. However, dissimilarities used in practice are usually far from ideal and the performance of the nearest neighbor rule suffers from its sensitivity to noisy examples. We show that other, more global classification techniques are preferable to the nearest neighbor rule, in such cases. For classification purposes, two different ways of using generalized dissimilarity kernels are considered. In the first one, distances are isometrically embedded in a pseudoEuclidean space and the classification task is performed there. In the second approach, classifiers are built directly on distance kernels. Both approaches are described theoretically and then compared using experiments with different dissimilarity measures and datasets including degraded data simulating the problem of missing values.
Mathematical Programming for Data Mining: Formulations and Challenges
 INFORMS Journal on Computing
, 1998
"... This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research ch ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems. keywords: data analysis, data mining, mathematical programming methods, challenges for massive data sets, classification, clustering, prediction, optimization. To appear: INFORMS: Journal of Compting, special issue on Data Mining, A. Basu and B. Golden (guest editors). Also appears as Mathematical Programming Technical Report 9801, Computer Sciences Department, University of Wi...
Prototype selection for dissimilaritybased classifiers
 Pattern Recognition
, 2006
"... A conventional way to discriminate between objects represented by dissimilarities is the nearest neighbor method. A more efficient and sometimes a more accurate solution is offered by other dissimilaritybased classifiers. They construct a decision rule based on the entire training set, but they nee ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
A conventional way to discriminate between objects represented by dissimilarities is the nearest neighbor method. A more efficient and sometimes a more accurate solution is offered by other dissimilaritybased classifiers. They construct a decision rule based on the entire training set, but they need just a small set of prototypes, the socalled representation set, as a reference for classifying new objects. Such alternative approaches may be especially advantageous for nonEuclidean or even nonmetric dissimilarities. The choice of a proper representation set for dissimilaritybased classifiers is not yet fully investigated. It appears that a random selection may work well. In this paper, a number of experiments has been conducted on various metric and nonmetric dissimilarity representations and prototype selection methods. Several procedures, like traditional feature selection methods (here effectively searching for prototypes), mode seeking and linear programming are compared to the random selection. In general, we find out that systematic approaches lead to better results than the random selection, especially for a small number of prototypes. Although there is no single winner as it depends on data characteristics, the kcentres works well, in general. For twoclass problems, an important observation is that our dissimilaritybased discrimination functions relying on significantly reduced prototype sets (3–10 % of the training objects) offer a similar or much better classification accuracy than the best kNN rule on the entire training set. This may be reached for multiclass data as well, however such problems are more difficult.
Classification on proximity data with lp–machines
, 1999
"... We provide a new linear program to deal with classification of data in the case of functions written in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed i ..."
Abstract

Cited by 38 (10 self)
 Add to MetaCart
We provide a new linear program to deal with classification of data in the case of functions written in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to –SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with –SV learning in proximity space and K–nearestneighbors on real world data from Neuroscience and molecular biology. 1