Results 1 -
9 of
9
Hierarchical Classification using Shrunken Centroids
, 2005
"... There are various types of classifiers that can be trained on gene expression data with class labels. Many of them have an embedded mechanism for feature selection, by which they distinguish a subset of significant genes that are used for future prediction. When dealing with more than two class labe ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
There are various types of classifiers that can be trained on gene expression data with class labels. Many of them have an embedded mechanism for feature selection, by which they distinguish a subset of significant genes that are used for future prediction. When dealing with more than two class labels, especially when the number goes up to a dozen or more, people find it useful to know the relative a#nity among the classes and di#erent subsets of genes involved in discriminating di#erent groups of classes. It provides them with more information not only when analyzing the relationship among classes, but also when predicting on future instances. We have achieved this by developing a hierarchical adaptation of the nearest shrunken centroid classifier. Here, we demonstrate our new method using a cancer data example.
Robust and accurate cancer classification with gene expression profiling
- in Proc. 4th IEEE Comput. Syst. Bioinf. Conf
, 2005
"... Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sam ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix Sw be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher’s criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when Sw is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of Sw, and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc. 1
Biologically-Interpretable Disease Classification Based on Gene Expression Data
- Virginia Polytechnic Institute and State University
, 2005
"... Classification of tissues and diseases based on gene expression data is a powerful application of DNA microarrays. Many popular classifiers like support vector machines, nearest-neighbour methods, and boosting have been applied successfully to this problem. However, it is difficult to determine from ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Classification of tissues and diseases based on gene expression data is a powerful application of DNA microarrays. Many popular classifiers like support vector machines, nearest-neighbour methods, and boosting have been applied successfully to this problem. However, it is difficult to determine from these classifiers which genes are responsible for the distinctions between the diseases. We propose a novel framework for classification of gene expression data based on notion of condition-specific clusters of co-expressed genes called xMotifs. Our xMotif-based classifier is biologically interpretable: we show how we can detect relationships between xMotifs and gene functional annotations. Our classifier achieves high-accuracy on leave-one-out cross-validation on both two-class and multi-class data. Our technique has the potential to be the method of choice for researchers interested in disease and tissue classification. Acknowledgments I would first like to acknowledge and thank my advisor, T. M. Murali, without whom this thesis would not be possible. I would also like to thank Jonathan Myers for assisting me with some of the work on functional enrichment and writing the original version of libEnrichment which xMotif uses to assess biological significance. iii
Algorithms for Feature Selection in Rank-Order Spaces
, 2005
"... The problem of feature selection in supervised learning situations is considered, where all features are drawn from a common domain and are best interpreted via ordinal comparisons with other features, rather than as numerical values. In particular, each instance is a member of a space of ranked fea ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The problem of feature selection in supervised learning situations is considered, where all features are drawn from a common domain and are best interpreted via ordinal comparisons with other features, rather than as numerical values. In particular, each instance is a member of a space of ranked features. This problem is pertinent in electoral, financial, and bioinformatics contexts, where features denote assessments in terms of counts, ratings, or rankings. Four algorithms for feature selection in such rank-order spaces are presented; two are information-theoretic, and two are order-theoretic. These algorithms are empirically evaluated against both synthetic and real world datasets. The main results of this paper are (i) characterization of relationships and equivalences between different feature selection strategies with respect to the spaces in which they operate, and the distributions they seek to approximate; (ii) identification of computationally simple and efficient strategies that perform surprisingly well; and (iii) a feasibility study of order-theoretic feature selection for large scale datasets. 1
Recursive Partitioning and Tree-based Methods
"... Tree-based methods have become one of the most flexible, intuitive, and powerful data analytic tools for exploring complex data structures. The applications ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Tree-based methods have become one of the most flexible, intuitive, and powerful data analytic tools for exploring complex data structures. The applications
Toxicogenomics | Article Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence
"... Class prediction using “omics ” data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables (e.g., genes of DNA microarray data or m/z peaks of ma ..."
Abstract
- Add to MetaCart
Class prediction using “omics ” data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables (e.g., genes of DNA microarray data or m/z peaks of mass spectrometry data). These characteristics manifest the importance of assessing potential random correlation and overfitting of noise for a classification model based on omics data. We present a novel classification method, decision forest (DF), for class prediction using omics data. DF combines the results of multiple heterogeneous but comparable decision tree (DT) models to produce a consensus prediction. The method is less prone to overfitting of noise and chance correlation. A DF model was developed to predict presence of prostate cancer using a proteomic data set generated from surface-enhanced laser deposition/ ionization time-of-flight mass spectrometry (SELDI-TOF MS). The degree of chance correlation and prediction confidence of the model was rigorously assessed by extensive cross-validation and randomization testing. Comparison of model prediction with imposed random correlation demonstrated biologic relevance of the model and the reduction of overfitting in DF. Furthermore, two confidence levels (high and low confidences) were assigned to each prediction,
Random Forest for Bioinformatics
"... Modern biology has experienced an increasing use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions ..."
Abstract
- Add to MetaCart
Modern biology has experienced an increasing use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning

