Results 1  10
of
41
A review of feature selection techniques in bioinformatics. Bioinformatics
 Proceedings of LBM’07) Y Saeys
, 2007
"... Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed t ..."
Abstract

Cited by 136 (6 self)
 Add to MetaCart
(Show Context)
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this paper, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications. Companion website:
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
 Bioinformatics
, 2004
"... This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun colle ..."
Abstract

Cited by 108 (4 self)
 Add to MetaCart
This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun collecting gene expression for a large number of samples. One of the urgent issues in the use of microarray data is to develop methods for characterizing samples based on their gene expression. The most basic step in the research direction is binary sample classification, which has been studied extensively over the past few years. This paper investigates the next step—multiclass classification of samples based on gene expression. The characteristics of expression data (e.g., large number of genes with small sample size)
Optimal Solutions for Sparse Principal Component Analysis
"... Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applica ..."
Abstract

Cited by 45 (11 self)
 Add to MetaCart
Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n 3), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n 3) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases.
Ghaoui. Full regularization path for sparse principal component analysis
 In Proceedings of the Twentyfourth International Conference on Machine Learning (ICML
, 2007
"... Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a particular linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a particular linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all numbers of non zero coefficients, with complexity O(n 3), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n 3). We show on toy examples and biological data that our algorithm does provide globally optimal solutions in many cases. 1.
Gene expression module discovery using gibbs sampling
 Genome Informatics
, 2004
"... Recent advances in high throughput profiling of gene expression have catalyzed an explosive growth in functional genomics aimed at the elucidation of genes that are differentially expressed in various tissue or cell types across a range of experimental conditions. These studies can lead to the ident ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Recent advances in high throughput profiling of gene expression have catalyzed an explosive growth in functional genomics aimed at the elucidation of genes that are differentially expressed in various tissue or cell types across a range of experimental conditions. These studies can lead to the identification of diagnostic genes, classification of genes into functional categories, association of genes with regulatory pathways, and clustering of genes into modules that are potentially coregulated by a group of transcription factors. Traditional clustering methods such as hierarchical clustering or principal component analysis are difficult to deploy effectively for several of these tasks since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering of gene expression data is a promising methodology for identification of gene groups that show a coherent expression profile across a subset of conditions. This methodology can be a first step towards the discovery of coregulated and coexpressed genes or modules. Although biclustering (also called block clustering) was introduced in statistics in 1974 few robust and efficient solutions exist for extracting gene expression modules in microarray data. In this paper, we propose a simple but promising new approach for biclustering based on a Gibbs sampling paradigm. Our algorithm is implemented in the program GEMS (Gene Expression Module Sampler). GEMS has been tested on synthetic data generated to evaluate the effect of noise on the performance of the algorithm as well as on published leukemia datasets. In our preliminary studies comparing GEMS with other biclustering software we show that GEMS is a reliable, flexible and computationally efficient approach for biclustering gene expression data.
A TwoStage Gene Selection Algorithm by Combining ReliefF and mRMR
"... Abstract—Gene expression data usually contains a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. In this paper, we present a twostage selection algorithm by ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Gene expression data usually contains a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. In this paper, we present a twostage selection algorithm by combining ReliefF and mRMR: In the first stage, ReliefF is applied to find a candidate gene set; In the second stage, mRMR method is applied to directly and explicitly reduce redundancy for selecting a compact yet effective gene subset from the candidate set. We also perform comprehensive experiments to compare the mRMRReliefF selection algorithm with ReliefF, mRMR and other feature selection methods using two classifiers as SVM and Naive Bayes, on seven different datasets. The experimental results show that the mRMRReliefF gene selection algorithm is very effective. Index Terms—Gene selection algorithms, reliefF, mRMR, mRMRreliefF.
An evolutionary clustering algorithm for gene expression microarray data analysis
 IEEE Transactions on Evolutionary Computation
, 2006
"... Abstract—Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clus ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster. Index Terms—Bioinformatics, clustering, DNA sequence analysis, evolutionary algorithms (EAs), gene expression microarray data analysis. I.
Feature Selection for Gene Expression using Modelbased Entropy
"... Abstract—Gene expression data usually contain a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract—Gene expression data usually contain a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a modelbased approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all realvalued distributions with specified mean and standard deviation, and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is normal distribution, its entropy can be computed with the logdeterminant of its covariance matrix. Because of the large number of genes, the computation of all possible logdeterminants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene datasets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.
BioMed Central
, 2006
"... A novel approach to phylogenetic tree construction using stochastic optimization and clustering ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
A novel approach to phylogenetic tree construction using stochastic optimization and clustering
Robust and accurate cancer classification with gene expression profiling
 in Proc. 4th IEEE Comput. Syst. Bioinf. Conf
, 2005
"... Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sam ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the withinclass scatter matrix Sw be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher’s criterion. GLDA is mathematically wellfounded and coincides with the conventional LDA when Sw is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of Sw, and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc. 1