Results 1 - 10
of
61
Minimum Redundancy Feature Selection from Microarray Gene Expression Data
- J Bioinform Comput Biol
, 2003
"... Motivation. How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that fe ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
Motivation. How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Results. We propose a minimum redundancy – maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 5 gene expression data sets: NCI,
Adaptive Dimension Reduction for Clustering High Dimensional Data
, 2002
"... It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this pro ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional subspace and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the e#ectiveness of the proposed algorithm.
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Prediction of MHC class I binding peptides, using SVMHC
, 2002
"... Background: T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogen ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Background: T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well as for the development of peptide vaccines. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested.
Kernel-based machine learning protocol for predicting DNA-binding proteins
- NUCLEIC ACIDS RES
, 2005
"... ..."
Multi-Class Protein Fold Classification Using a New Ensemble Machine Learning Approach
, 2003
"... Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships.
Multi-class protein fold recognition using adaptive codes
- Proceedings of the 22nd International Conference on Machine Learning
, 2005
"... We develop a novel multi-class classification method based on output codes for the problem of classifying a sequence of amino acids into one of many known protein structural classes, called folds. Our method learns relative weights between one-vs-all classifiers and encodes information about the pro ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We develop a novel multi-class classification method based on output codes for the problem of classifying a sequence of amino acids into one of many known protein structural classes, called folds. Our method learns relative weights between one-vs-all classifiers and encodes information about the protein structural hierarchy for multi-class prediction. Our code weighting approach significantly improves on the standard one-vs-all method for the fold recognition problem. In order to compare against widely used methods in protein sequence analysis, we also test nearest neighbor approaches based on the PSI-BLAST algorithm. Our code weight learning algorithm strongly outperforms these PSI-BLAST methods on every structure recognition problem we consider. 1.
Analysis of Gene Expression Profiles: Class Discovery and Leaf Ordering
- In Proc. 6th Int'l Conf. Research in Comp. Mol. Bio.(RECOMB 2002
, 2002
"... Abstract We approach the class discovery and leaf ordering problems using spectral graph partitioning methodologies. For class discovery or clustering, we present a rain-max cut hierarchical clustering method and show it produces subtypes quite close to human expert labeling on the lymphoma dataset ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Abstract We approach the class discovery and leaf ordering problems using spectral graph partitioning methodologies. For class discovery or clustering, we present a rain-max cut hierarchical clustering method and show it produces subtypes quite close to human expert labeling on the lymphoma dataset with 6 classes. On optimal leaf ordering for displaying the gene expression data, we present a sequential or- dering method that can be computed in O(tz 2) time which also preserves the cluster structure. We also show that the well known statistic methods such as F-statistic test and the principal component analysis are very useful in gene expression analysis.
Protein function classification via support vector machine approach
- MATHEMATICAL BIOSCIENCES
, 2003
"... ..."
Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein
- Sel
, 2004
"... Membrane proteins are generally classified into the following five types: (1) type I membrane protein, (2) type II membrane protein, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins, and (5) GPIanchored membrane proteins. Prediction of membrane protein types has becom ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Membrane proteins are generally classified into the following five types: (1) type I membrane protein, (2) type II membrane protein, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins, and (5) GPIanchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area. One is how to take into account the extremely complicated sequence-order effects; the other is how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino-acid composition (Chou, K.C.: PROTEINS: Structure, Function, and Genetics, 43: 246-255, 2001; ibid. 2001, 44, 60) to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has a remarkable power in dealing with the bias caused by the situation when one subset in the training dataset

