Results 1 - 10
of
11
Using the Fisher kernel method to detect remote protein homologies
- In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... A new method, called the Fisher kernel method, for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a hid ..."
Abstract
-
Cited by 125 (3 self)
- Add to MetaCart
A new method, called the Fisher kernel method, for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a hidden Markov model. The general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Combining Pairwise Sequence Similarity and Support Vector Machines for Remote Protein Homology Detection
- J. Comput. Biol
, 2002
"... One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose func ..."
Abstract
-
Cited by 116 (12 self)
- Add to MetaCart
One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVMbased protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as SmithWaterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.
Classifying G-protein coupled receptors with support vector machines
- Bioinformatics
, 2001
"... Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a signicant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical prole hidden Markov model, and methods, including support vector machines, that transform protein sequences into xed-length feature vectors. Results: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classication, the results are worth the eort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specic ligand (such as a histamine molecule), the errors per sequence at the minimum error point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classication, 25.5% for BLAST, 30% for prole HMMs, and 49% for classication based on nearest neighbor feature vector (kernNN). The percentage of true positives recognized before the rst false positive was 65% for both SVM methods, 13% for BLAST, 5% for prole HMMs and 4% ...
Weighting Hidden Markov Models For Maximum Discrimination
- Bioinformatics
, 1998
"... 1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting metho ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea. 1.2 Results One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction). 1.3 Availability A World-Wide Web server, as well as information on obtaining the Sequence Alignme...
BAG: a graph theoretic sequence clustering algorithm
- Int. J. Data Mining and Bioinformatics
, 2003
"... Recently developed sequence clustering algorithms based on graph theory have been successful in clustering a large number of sequences into families of sequences of specific categories. In this paper, we present a new sequence clustering algorithm BAG based on graph theory. Our algorithm clusters se ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
Recently developed sequence clustering algorithms based on graph theory have been successful in clustering a large number of sequences into families of sequences of specific categories. In this paper, we present a new sequence clustering algorithm BAG based on graph theory. Our algorithm clusters sequences using two properties of graph, biconnected component and articulation point. As computation of biconnected components and articulation points is efficient, linear in relation to the number of vertices and edges, our algorithms are well suited for comparing a large number of proteins from multiple genomes. Our experiments with protein sequences from multiple genomes show that our algorithms generate families of high quality. For example, our algorithm correctly classified 3,306 predicted proteins from E. coli and H. influenzae into 1,427 families without human intervention. We also dicuss the importance of large scale sequence comparisons from our experience in clustering many different genomes, including Arabidopsis thaliana. 1
Classifying Proteins By Family Using the Product of Correlated
, 1999
"... An important goal in bioinformatics is determining the homology and function of proteins from their sequences. Pairwise sequence similarity algorithms are often employed for this purpose. This paper describes a method for improving the accuracy of such algorithms using knowledge about families of pr ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
An important goal in bioinformatics is determining the homology and function of proteins from their sequences. Pairwise sequence similarity algorithms are often employed for this purpose. This paper describes a method for improving the accuracy of such algorithms using knowledge about families of proteins. The method requires a library of protein families against which to compare query sequences. A standard pairwise similarity search algorithm is used to search the library with the query, and a new variant of the Family Pairwise Search (FPS) algorithm converts the results into a list sorted by the E-values of the matches between the query and the families. The E-value of each query-family match is calculated using a statistical distribution introduced here that describes the behavior of the product of the p-values of correlated random variables. We also describe an algorithm (ESIZE) for estimating the single parameter of this distribution. This parameter summarizes the amount of correl...
Sequence Database Search Using Jumping Alignments
, 2000
"... We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The algorithm is based on the dynamic programming principle and evaluates the fit of a candidate sequence to a given family of sequences by means of a new score called the "jumping alignment sc ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The algorithm is based on the dynamic programming principle and evaluates the fit of a candidate sequence to a given family of sequences by means of a new score called the "jumping alignment score". In a jumping alignment, a candidate sequence is locally aligned to one reference sequence in the family, and in addition the reference sequence may change within the alignment. We show that the algorithm performs well in recovering subfamilies of the SCOP database.
A.: Cluster utility: A new metric to guide sequence clustering
, 2004
"... Automatic sequence clustering has become increasingly important in analyzing the ever increasing number of biological sequences. Although there has been significant progress recently in developing high performance sequence clustering algorithms, correctly clustering a large number of sequences still ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Automatic sequence clustering has become increasingly important in analyzing the ever increasing number of biological sequences. Although there has been significant progress recently in developing high performance sequence clustering algorithms, correctly clustering a large number of sequences still remains a huge challenge. More often than not, the clusters generated end up being incorrect or fragmented. We have developed a new metric called the cluster utility to guide cluster splitting. We have illustrated the effectiveness of this technique by implementing it in the BAG clustering algorithm. Experiments with the entire COG database show that the proposed technique can effectively guide correct sequence clustering even while keeping the number of fragmented clusters significantly low.
QOMA: quasi-optimal multiple alignment of protein sequences
- Bioinformatics
"... doi:10.1093/bioinformatics/btl590 ..."
A Bayesian Approach to Motif-based Protein Modeling
, 1998
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A. Biology background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Genes and proteins . . . . . . . . . . . . . . . . . . . ..."
Abstract
- Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A. Biology background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Genes and proteins . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Protein structure and function . . . . . . . . . . . . . . . . . . . . 6 3. Three tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4. Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 B. Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1. De#nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2. The standard HMM topology . . . . . . . . . . . . . . . . . . . . . 16 3. Using HMMs for multiple alignment . . . . . . . . . . . . . . . . . 18 4. Using HMMs for homology detection . . . . . . . . . . . . . . . . . 19 5. Drawbacks of the standard topology . . . . . . . . . . . . . . ...

