Results 1 - 10
of
27
Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions
- Bioinformatics
, 2003
"... Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the ‘curse of dimensionality’: the number of features characterizing these data is in the thousands or tens of thousands. The oth ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Motivation: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the ‘curse of dimensionality’: the number of features characterizing these data is in the thousands or tens of thousands. The other is the ‘curse of dataset sparsity’: the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease. Results: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5–10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several ‘optimal’ feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these ‘optimal’ feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.
Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data
, 2005
"... ..."
Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue
"... Abstract—It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related sampl ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Abstract—It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related samples. In order to simultaneously cluster genes and conditions, we have previously developed a fast coclustering algorithm, Minimum Sum-Squared Residue Coclustering (MSSRCC), which employs an alternating minimization scheme and generates what we call coclusters in a “checkerboard ” structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression data sets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing coclustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting coclusters in a checkerboard structure, where genes in a cocluster manifest the phenotype structure of corresponding specific samples and evaluate the enrichment of functional annotations in Gene Ontology (GO). Index Terms—Microarray analysis, coclustering, binormalization, deterministic spectral initialization, local search, gene ontology. 1
The Generalized LASSO: a wrapper approach to gene selection for microarray data
- Proceedings 14th International Conference on Automated Deduction (CADE-14), 252--255
, 2002
"... We report on the successful application of the Generalized LASSO method to feature selection problems for microarray data. This method implements a wrapper strategy for selecting relevant genes by optimizing the discriminative power of a logistic classification model. The selection process can be ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We report on the successful application of the Generalized LASSO method to feature selection problems for microarray data. This method implements a wrapper strategy for selecting relevant genes by optimizing the discriminative power of a logistic classification model. The selection process can be interpreted as a special instance of the Bayesian automatic relevance determination (ARD) principle.
Gene Expression Profile Classification: A Review
- Current Bioinformatics
, 2006
"... Abstract: In this review, we have discussed the class-prediction and discovery methods that are applied to gene expression data, along with the implications of the findings. We attempted to present a unified approach that considers both class-prediction and class-discovery. We devoted a substantial ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract: In this review, we have discussed the class-prediction and discovery methods that are applied to gene expression data, along with the implications of the findings. We attempted to present a unified approach that considers both class-prediction and class-discovery. We devoted a substantial part of this review to an overview of pattern classification/recognition methods and discussed important issues such as preprocessing of gene expression data, curse of dimensionality, feature extraction/selection, and measuring or estimating classifier performance. We discussed and summarized important properties such as generalizability (sensitivity to overtraining), built-in feature selection, ability to report prediction strength, and transparency (ease of understanding of the operation) of different class-predictor design approaches to provide a quick and concise reference. We have also covered the topic of biclustering, which is an emerging clustering method that processes the entries of the gene expression data matrix in both gene and sample directions simultaneously, in detail. 1.
Robust and accurate cancer classification with gene expression profiling
- in Proc. 4th IEEE Comput. Syst. Bioinf. Conf
, 2005
"... Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sam ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix Sw be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher’s criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when Sw is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of Sw, and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc. 1
Effect of data transformation on residue
, 2007
"... Recently, Aguilar-Ruiz [2005] considers a data matrix containing both scaling and shifting factors and shows that the mean squared residue [Cheng and Church, 2000], called RESIDUE(II) in this paper, is useful to discover shifting patterns, but not appropriate to find scaling patterns. This finding d ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Recently, Aguilar-Ruiz [2005] considers a data matrix containing both scaling and shifting factors and shows that the mean squared residue [Cheng and Church, 2000], called RESIDUE(II) in this paper, is useful to discover shifting patterns, but not appropriate to find scaling patterns. This finding draws our attention on the weakness of RESIDUE(II) measure and the need of new approaches to discover both scaling and shifting patterns in the considered matrix. To resolve the weakness of RESIDUE(II) in finding scaling patterns, we propose a simple remedy that still uses the same residue measure. The main idea is to remove hidden scaling factors in the considered data matrix by taking a specific data transformation. We investigate various data transformations including no transformation, double centering, mean centering, standard deviation normalization, and Z-score transformation. Further, we apply these data transformations to row/column dimension of data matrix models with different global/local scaling and global/local shifting factors. First, we characterize the properties of the data transformations on different data matrix models, including six Euclidean co-clustering schemes in Bregman co-clustering algorithms [Banerjee et al., 2007] and other existing data models in the literature. In particular, we formally analyze the effect of each data transformation on the two residues [Cho et al., 2004], here called RESIDUE(I) and RESIDUE(II), respectively. Then, we apply all the data transformations to publicly available human cancer gene expression datasets and empirically validate the analysis results by using the minimum sum squared residue co-clustering (MSSRCC) algorithms [Cho et al., 2004]. In conclusion, through column standard deviation normalization or column Z-score transformation, we are able to overcome the shortcoming of RESIDUE(II) in finding scaling patterns and discover both scaling and shifting patterns. 1
Robust overlapping co-clustering
- Dept. of ECE, Univ. of Texas at Austin, IDEAL-TR09, Downloadable from http://www.lans.ece.utexas.edu/papers/ techreports/deodhar08ROCC.pdf
, 2008
"... Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful cluster ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful clusters, both non-informative data points and non-discriminative features need to be discarded. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “one-sided” clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently detecting dense, arbitrarily positioned, possibly overlapping co-clusters in a dataset. ROCC works with a large variety of distance measures and different co-cluster definitions, making it applicable to a wide range of real life datasets. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches proposed for this task. We also point out other interesting applications of the proposed framework in solving challenging
Optimality Driven Nearest Centroid Classification from Genomic Data
"... Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on geneexpression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

