Results 1 - 10
of
30
Learning Gene Functional Classifications From Multiple Data Types
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2002
"... In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measureme ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. In addition, we describe feature scaling methods for further exploiting prior knowledge of heterogeneity by giving each data type different weights.
Integrated Probabilistic Model for Functional Prediction of
- J Comput Biol
, 2004
"... We develop an integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression network, protein complex data and domain structures of individual proteins together to prediction protein functions. The model is an extension of our previo ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
We develop an integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression network, protein complex data and domain structures of individual proteins together to prediction protein functions. The model is an extension of our previous model for protein function prediction based on Markovian random field theory. The model is flexible that other protein pairwise relationship information and features of individual proteins can be easily incorporated. Two features distinguish the integrated approach from other available methods for protein function prediction. One is that the integrated approach uses all available sources of information with di#erent weights for di#erent sources of data. It is a global approach taking the whole network into consideration. The other is that posterior probability for the protein to have the function of interest is assigned. The posterior probability indicates how confident we are about assigning the function to the protein. We apply our integrated approach to predict functions of yeast proteins based on MIPS protein function classifications and the interaction networks based on MIPS physical and genetic interactions, gene expression profiles, and Tandem A#nity Purification (TAP) protein complex data, and protein domain information. We study the sensitivity and specificity of the integrated approach using di#erent sources of information by the leave-one-out approach. Compared to using MIPS physical interactions only, the integrated approach combining all the information increases the sensitivity from 57% to 87% when the specificity is set at 57%, an increase of 30%. It should also be noted that by enlarging the interaction network, the number of proteins whose functions can be...
Machine Learning of Functional Class From Phenotype Data
- Bioinformatics
, 2002
"... Motivation: Mutant phenotype growth experiments are an important novel source of functional genomics data which have received little attention in bioinformatics. We applied supervised machine learning to the problem of using phenotype data to predict the functional class of ORFs in S. cerevisiae. Th ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Motivation: Mutant phenotype growth experiments are an important novel source of functional genomics data which have received little attention in bioinformatics. We applied supervised machine learning to the problem of using phenotype data to predict the functional class of ORFs in S. cerevisiae. Three sources of data were used: TRIPLES, EUROFAN and MIPS. The analysis of the data presented a number of challenges to machine learning: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classi- cation), and a very large amount of missing values. We modi ed the algorithm C4.5 to deal with these problems.
Nonstationary kernel combination
- In 23rd International Conference on Machine Learning (ICML
, 2006
"... The power and popularity of kernel methods stem in part from their ability to handle diverse forms of structured inputs, including vectors, graphs and strings. Recently, several methods have been proposed for combining kernels from heterogeneous data sources. However, all of these methods produce st ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
The power and popularity of kernel methods stem in part from their ability to handle diverse forms of structured inputs, including vectors, graphs and strings. Recently, several methods have been proposed for combining kernels from heterogeneous data sources. However, all of these methods produce stationary combinations; i.e., the relative weights of the various kernels do not vary among input examples. This article proposes a method for combining multiple kernels in a nonstationary fashion. The approach uses a large-margin latentvariable generative model within the maximum entropy discrimination (MED) framework. Latent parameter estimation is rendered tractable by variational bounds and an iterative optimization procedure. The classifier we use is a log-ratio of Gaussian mixtures, in which each component is implicitly mapped via a Mercer kernel function. We show that the support vector machine is a special case of this model. In this approach, discriminative parameter estimation is feasible via a fast sequential minimal optimization algorithm. Empirical results are presented on synthetic data, several benchmarks, and on a protein function annotation task. 1.
Integrating genomic data to predict transcription factor binding
- Genome Inform. Ser. Workshop Genome Inform
, 2005
"... Transcription factor binding sites (TFBS) in gene promoter regions are often predicted by using position specific scoring matrices (PSSMs), which summarize sequence patterns of experimentally determined TF binding sites. Although PSSMs are more reliable than simple consensus string matching in predi ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Transcription factor binding sites (TFBS) in gene promoter regions are often predicted by using position specific scoring matrices (PSSMs), which summarize sequence patterns of experimentally determined TF binding sites. Although PSSMs are more reliable than simple consensus string matching in predicting a true binding site, they generally result in high numbers of false positive hits. This study attempts to reduce the number of false positive matches and generate new predictions by integrating various types of genomic data by two methods: a Bayesian allocation procedure, and support vector machine classification. Several methods will be explored to strengthen the prediction of a true TFBS in the Saccharomyces cerevisiae genome: binding site degeneracy, binding site conservation, phylogenetic profiling, TF binding site clustering, gene expression profiles, GO functional annotation, and k-mer counts in promoter regions. Binding site degeneracy (or redundancy) refers to the number of times a particular transcription factor’s binding motif is discovered in the upstream region of a gene. Phylogenetic conservation takes into account the number of orthologous upstream regions in other genomes that contain a particular binding site. Phylogenetic profiling refers to the presence or
Gene function classification using Bayesian models with hierarchybased priors
- BMC Bioinformatics
, 2006
"... Abstract. We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models an ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract. We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information. 1
Protein functional class prediction with a combined graph
- Proceedings of the Korean Data Mining Conference
, 2004
"... Abstract. In bioinformatics, there exist multiple descriptions of graphs for the same set of genes or proteins. For instance, in yeast systems, graph edges can represent different relationships such as protein-protein interactions, genetic interactions, or co-participation in a protein complex, etc. ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. In bioinformatics, there exist multiple descriptions of graphs for the same set of genes or proteins. For instance, in yeast systems, graph edges can represent different relationships such as protein-protein interactions, genetic interactions, or co-participation in a protein complex, etc. Relying on similarities between nodes, each graph can be used independently for prediction of protein function. However, since different graphs contain partly independent and partly complementary information about the problem at hand, one can enhance the total information extracted by combining all graphs. In this paper, we propose a method for integrating multiple graphs within a framework of semi-supervised learning. The method alternates between minimizing the objective function with respect to network output and with respect to combining weights. We apply the method to the task of protein functional class prediction in yeast. The proposed method performs significantly better than the same algorithm trained on any single graph. 1
Gene Functional Classification by Semi-supervised Learning from Heterogeneous Data
- IN PROC. ACM SYMPOSIUM ON APPLIED COMPUTING (SAC
, 2003
"... Gene function discovery is an important and interesting problem in computational analysis of microarray data. In this paper, we investigate the use of a semi-supervised learning algorithm for inferring gene functional classifications from heterogeneous data set consisting of DNA microarray expressio ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Gene function discovery is an important and interesting problem in computational analysis of microarray data. In this paper, we investigate the use of a semi-supervised learning algorithm for inferring gene functional classifications from heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence compassions. The semisupervised learning approach aims at minimizing the disagreement between individual models built from each separate information source by employing a co-updating method and making use of both labeled and unlabeled data. Our results suggest that the semisupervised approach could be used for gene functional classification. The data sets and the program code used for the experiments can be accessed from our webpage.
A Support Vector Machine Approach to the Identification of Phosphorylation
"... Abstract: We describe a bioinformatics tool that can be used to predict the position of phosphorylation sites in proteins based only on sequence information. The method uses the support vector machine (SVM) statistical learning theory. The statistical models for phosphorylation by various types of k ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract: We describe a bioinformatics tool that can be used to predict the position of phosphorylation sites in proteins based only on sequence information. The method uses the support vector machine (SVM) statistical learning theory. The statistical models for phosphorylation by various types of kinases are built using a dataset of short (9-amino acid long) sequence fragments. The sequence segments are dissected around post-translationally modified sites of proteins that are on the current release of the Swiss-Prot database, and that were experimentally confirmed to be phosphorylated by any kinase. We represent them as vectors in a multidimensional abstract space of short sequence fragments. The prediction method is as follows. First, a given query protein sequence is dissected into overlapping short segments. All the fragments are then projected into the multidimensional space of sequence fragments via a collection of different representations. Those points are classified with pre-built statistical models (the SVM method with linear, polynomial and radial kernel functions) either as phosphorylated or inactive
Machine Learning in Low-level Microarray Analysis
- SIGKDD Explorations
, 2003
"... Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.

