Results 1 - 10
of
10
A Hidden Markov Model Variant for Sequence Classification
- PROCEEDINGS OF THE TWENTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
"... Sequence classification is central to many practical problems within machine learning. Distances metrics between arbitrary pairs of sequences can be hard to define because sequences can vary in length and the information contained in the order of sequence elements is lost when standard metrics such ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Sequence classification is central to many practical problems within machine learning. Distances metrics between arbitrary pairs of sequences can be hard to define because sequences can vary in length and the information contained in the order of sequence elements is lost when standard metrics such as Euclidean distance are applied. We present a scheme that employs a Hidden Markov Model variant to produce a set of fixed-length description vectors from a set of sequences. We then define three inference algorithms, a Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm, to infer model parameters. Finally, we show experimentally that the fixed length representation produced by these inference methods is useful for classifying sequences of amino acids into structural classes.
Combining classifiers for improved classification of proteins from sequence or structure
"... Background: Predicting a protein’s structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Background: Predicting a protein’s structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage. Results: In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of “punting ” from one method to another using a learned threshold. Conclusions: In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage. Code and data sets are available at
A Study Of Hierarchical and Flat Classification Of Proteins
- IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
, 2008
"... Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performa ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area.
Target identification for . . . Ranking based Methods
, 2008
"... Drug discovery is an expensive process. It has been estimated that a new drug compound that is introduced in the market after FDA approval carries a cost of approximately $800 million from the conception of target implicated for a disease to successful identification of chemical entity or drug that ..."
Abstract
- Add to MetaCart
Drug discovery is an expensive process. It has been estimated that a new drug compound that is introduced in the market after FDA approval carries a cost of approximately $800 million from the conception of target implicated for a disease to successful identification of chemical entity or drug that is successful in human trials. There is a need to cut the cost of developing new drugs (to bring overall cost lower for the producers and consumers alike) by identifying promising candidate targets as well as compounds and to tackle problems such an toxicity, lack of efficacy in humans, and poor physical properties in the early stages of drug discovery. In order to achieve this objective, in recent years the development of computational techniques that identify all the likely targets for a given chemical compound has been an active area of research. Identification of all the potential targets for a chemical compound
BMC Bioinformatics BioMed Central Methodology article Testing gene set enrichment for subset of genes: Sub-GSE
, 2008
"... which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Background: Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene ex ..."
Abstract
- Add to MetaCart
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Background: Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest. Results: In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are
Content-Based Methods for Predicting Web-Site Demographic Attributes
, 2010
"... Demographic information plays an important role in gaining valuable insights about a web-site’s user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using informati ..."
Abstract
- Add to MetaCart
(Show Context)
Demographic information plays an important role in gaining valuable insights about a web-site’s user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyperlinked structure and not relying on any information directly or indirectly obtained from the web-site’s users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web’s hyperlinked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8–10 % and provide insights on how to best train and apply such models.
Multiple Kernel Learning for Fold Recognition
"... Fold recognition is a key problem in computational biology that involves classifying protein sharing structural similari-ties into classes commonly known as “folds”. Recently, re-searchers have developed several efficient kernel based dis-criminatory methods for fold classification using sequence in ..."
Abstract
- Add to MetaCart
(Show Context)
Fold recognition is a key problem in computational biology that involves classifying protein sharing structural similari-ties into classes commonly known as “folds”. Recently, re-searchers have developed several efficient kernel based dis-criminatory methods for fold classification using sequence in-formation. These methods train one-versus-rest binary classi-fiers using well optimized kernels from different data sources and techniques. Integrating this vast amount of data in the form of ker-nel matrices is an interesting and challenging problem. The semidefinite positive property of the various kernel matrices makes it attractive to cast the task of learning an optimal weighting of several kernel matrices as a semi-definite pro-gramming optimization problem. We experiment with a previ-ously introduced quadratically constrained quadratic optimiza-tion problem for kernel integration using 1-norm and 2-norm support vector machines. We integrate state-of-the-art profile-based direct kernels to learn an optimal kernel matrixK∗. Our experimental results show a small significant improvement in terms of the classification accuracy across the different fold classes. Our analysis illustrates the strength of two dominating kernels across different fold classes, which suggests the redun-dant nature of the kernel matrices being combined. 1
Fixed Length Representations of Sequence Data Using a Variant of the Hidden Markov Model
"... Sequence classification is central to many practical problems within machine learning. Classification al-gorithms often center around the notion of a dis-tance metric between examples. Unlike sequences, the Euclidean distance metric between vectors often has an intuitive meaning which transfers natu ..."
Abstract
- Add to MetaCart
(Show Context)
Sequence classification is central to many practical problems within machine learning. Classification al-gorithms often center around the notion of a dis-tance metric between examples. Unlike sequences, the Euclidean distance metric between vectors often has an intuitive meaning which transfers naturally to a meaning in the classification domain. Distances metrics between arbitrary pairs of sequences, how-ever, can be harder to define because sequences can vary in both length and the information contained in the order of sequence elements is lost when standard distance metrics are applied. We present a scheme that employs a Hidden Markov Model variant to pro-duce a set of fixed-length vectors from a set of se-quences. We then define three inference algorithms, a Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm, to infer model parame-ters. Finally, we show experimentally that the fixed length representation produced by these inference methods is useful for classifying proteins by struc-tural taxonomy. 1
Co-Multistage of Multiple Classifiers for
"... Abstract. In this work, we propose two stochastic architectural models (CMC and CMC-M) with two layers of classifiers applicable to datasets with one and multiple skewed classes. This distinction becomes impor-tant when the datasets have a large number of classes. Therefore, we present a novel solut ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. In this work, we propose two stochastic architectural models (CMC and CMC-M) with two layers of classifiers applicable to datasets with one and multiple skewed classes. This distinction becomes impor-tant when the datasets have a large number of classes. Therefore, we present a novel solution to imbalanced multiclass learning with several skewed majority classes, which improves minority classes identification. This fact is particularly important for text classification tasks, such as event detection. Our models combined with preprocessing sampling tech-niques improved the classification results on 6 well-known datasets. Fi-nally, we have also introduced a new metric SG-Mean to overcome the multiplication by zero limitation of G-Mean. 1