Results 1  10
of
12
Hidden Markov models for detecting remote protein homologies
 Bioinformatics
, 1998
"... A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is ..."
Abstract

Cited by 442 (15 self)
 Add to MetaCart
A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAMT98 method with four datasets. Three of the test sets are foldrecognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wublastp and against doubleblast, a twostep method similar to ISS, but using blast instead of fasta. Results SAMT98 had the fewest errors in all tests dramatically so for the foldrecognition tests. At the minimumerror point on the SCOPdomains test, SAMT98 got 880 true positives and 68 false positives, doubleblast got 533 true positives with 71 false positives, and wublastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new scorenormalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
A Discriminative Framework for Detecting Remote Protein Homologies
, 1999
"... A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a ..."
Abstract

Cited by 237 (4 self)
 Add to MetaCart
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Predicting protein structure using hidden Markov models
, 1997
"... We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov mode ..."
Abstract

Cited by 69 (21 self)
 Add to MetaCart
We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov models. In addition, a hidden Markov model was built for each of the target sequences, and all of the sequences in PDB were scored against that target model. Having high scores from both methods was found to be highly indicative of the target and a structure being homologous. Predictions were made based on several criteria: the scores with the structure models, the scores with the target models, consistency between the secondary structure in the known structure and predictions for the target (using the program PhD), human examination of predicted alignments between target and structure (using RASMOL), and solvation preferences in the alignment of the target and structure. The method worked well in comparison to other methods used at CASP2 for targets of moderate difficulty, where the closest structure in PDB could be aligned to the target with at least 15 % residue identity. There was no evidence for the method's e ectiveness for harder cases, where the residue identity was much lower than 15%.
Scoring Hidden Markov Models
"... Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler s ..."
Abstract

Cited by 51 (6 self)
 Add to MetaCart
Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler statistical model intended to represent the universe of sequences as a whole, rather than the group of interest. Such scoring leads to two immediate questions: what should the null model be, and what threshold of logodds score should be deemed a match to the model. Results: This paper experimentally analyses these two issues. Within the context of the Sequence Alignment and Modeling software suite (SAM), we consider a variety ofnull models and suitable thresholds. Additionally, we consider HMMer's logodds scoring and SAM's original Zscoring method. Among the null model choices, a simple looping null model that emits characters according to the geometric mean of the character probabilities in the columns modeled by the HMM performs well or best across all four discrimination experiments.
Familybased Homology Detection via Pairwise Sequence Comparison
, 1998
"... The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for #nding additional homologs: pairwise sequence comparisons, m ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for #nding additional homologs: pairwise sequence comparisons, motif analysis, and hidden Markov modeling. Pairwise sequence comparisons are typically employed when only a single query sequence is known. Hidden Markov models #HMMs#, on the other hand, are usually trained with sets of more than 100 sequences. Motifbased methods fall in between these two extremes. The currentwork compares the performance of representative examples of these three homology detection techniquesusing the BLAST, MEME and HMMER softwareacross a wide range of protein families, using query sets of varying sizes. A linear combination of multiple pairwise sequence comparisons outperforms motifbased and HMM methods for all query set sizes. Furthermore, heuristic pairwise com...
Homology Detection via Family Pairwise Search
 Journal of Computational Biology
, 1998
"... The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for finding additional homologs: pairwise sequence comparisons, ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for finding additional homologs: pairwise sequence comparisons, motif analysis, and hidden Markov modeling. Pairwise sequence comparisons are typically employed when only a single query sequence is known. Hidden Markov models (HMMs), on the other hand, are usually trained with sets of more than 100 sequences. Motifbased methods fall in between these two extremes. The current work introduces a straightforward generalization of pairwise sequence comparison algorithms to the case when when multiple query sequences are available. This algorithm, called Family Pairwise Search (FPS), combines pairwise sequence comparison scores from each query sequence. A BLAST implementation of FPS is compared to representative examples of hidden Markov modeling...
Evaluating regularizers for estimating distributions of amino acids
 In
, 1995
"... This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zerooffsets, pseudocounts, substitution m ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zerooffsets, pseudocounts, substitution matrices (with several variants), and Dirichlet mixture regularizers. Each regularizer is evaluated based on how well it estimates the distributions of the columns of a multiple alignmentâ€”specifically, the expected encoding cost per amino acid using the regularizer and all possible samples from each column. In general, pseudocounts give the lowest encoding costs for samples of size zero, substitution matrices give the lowest encoding costs for samples of size one, and Dirichlet mixtures give the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best known Dirichlet mixtures, but with a lower computation cost.
Family Pairwise Search with Embedded Motif Models
, 1999
"... Motivation: Statistical models of protein families, such as positionspecific scoring matrices, profiles and hidden Markov models, have been used effectively to find remote homologs when given a set of known protein family members. Unfortunately, training these models typically requires a relatively ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Motivation: Statistical models of protein families, such as positionspecific scoring matrices, profiles and hidden Markov models, have been used effectively to find remote homologs when given a set of known protein family members. Unfortunately, training these models typically requires a relatively large set of training sequences. Recentwork [Grundy, 1998] has shown that, when only a few family members are known, several theoretically justified statistical modeling techniques fail to provide homology detection performance on a par with Family Pairwise Search (FPS), an algorithm that combines scores from a pairwise sequence similarity algorithm such as BLAST. Results: This paper provides a modelbased algorithm that improves FPS by incorporating hybrid motifbased models of the form generated by Cobbler [Henikoff and Henikoff, 1997]. For the 73 protein families investigated here, this cobbled FPS algorithm provides better homology detection performance than either Cobbler or FPS alo...
Classification of glycoside hydrolase sequences using hidden Markov models
"... Enzymes that hydrolyse oligo and polysaccharides, glycoside hydrolases, play a central role in a wide array of biological processes. A manual (sequence based) classification of glycoside hydrolases using hydrophobic cluster analysis, HCA, was introduced in 1991 (B.Henrissat). By using hidden Markov ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Enzymes that hydrolyse oligo and polysaccharides, glycoside hydrolases, play a central role in a wide array of biological processes. A manual (sequence based) classification of glycoside hydrolases using hydrophobic cluster analysis, HCA, was introduced in 1991 (B.Henrissat). By using hidden Markov models we have been able to automate the classi cation of unknown proteins into glycoside hydrolase sequence families. At a significance level of 0.01, Dirichlet regulized models are able to successfully discriminate 99.4% of the glycoside hydrolase sequences with a 100% precision.
unknown title
"... Abstract. With the exploding size of genome databases, it is becoming increasingly important to devise search procedures that extract relevant information from them. One such procedure is particularly eective in nding new, distant members of a given family of related sequences: start with a multiple ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. With the exploding size of genome databases, it is becoming increasingly important to devise search procedures that extract relevant information from them. One such procedure is particularly eective in nding new, distant members of a given family of related sequences: start with a multiple alignment of the given members of the family and use an integral or fractional consensus sequence derived from the alignment to further probe the database. However, the multiple alignment constructed to begin with may be biased due to skew in the sample of sequences used to construct it. We suggest strategies to overcome the problem of bias in building consensus sequences. When the intention is to build a fractional consensus sequence (often termed a prole), we propose assigning weights to the sequences such that the resulting fractional sequence has roughly the same similarity score against each of the sequences in the family. We call such fractional consensus sequences balanced proles. On the other hand, when only regular sequences can be used in the search, we propose that the consensus sequence have minimum maximum distance from any sequence in the family to avoid bias. Such sequences are NPhard to compute exactly, so we present an approximation algorithm with very good performance ratio based on randomized rounding of an integer programming formulation of the problem. We also mention applications of the rounding method to selection of probes for disease detection and to construction of consensus maps. 1