Results 1 - 10
of
83
Hidden Markov models for detecting remote protein homologies
- Bioinformatics
, 1998
"... A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is ..."
Abstract
-
Cited by 229 (12 self)
- Add to MetaCart
A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAM-T98 method with four datasets. Three of the test sets are fold-recognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wu-blastp and against double-blast, a two-step method similar to ISS, but using blast instead of fasta. Results SAM-T98 had the fewest errors in all tests| dramatically so for the fold-recognition tests. At the minimum-error point on the SCOP-domains test, SAM-T98 got 880 true positives and 68 false positives, double-blast got 533 true positives with 71 false positives, and wu-blastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new score-normalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract
-
Cited by 105 (20 self)
- Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
PROBCONS: Probabilistic consistency-based multiple sequence alignment
- Genome Res
, 2005
"... To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objec ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce prob-abilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on prob-abilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROB-CONS achieves statistically significant improvement over other leading methods while maintain-ing practical speed. PROBCONS is publicly available as a web resource. Source code and execu-tables are available under the GNU Public License at
ProtoMap: automatic classification of protein sequences and hierarchy of protein families
- Nucleic Acids Res
, 1999
"... ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.al ..."
Abstract
-
Cited by 71 (13 self)
- Add to MetaCart
ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domain-based classifications for between 64.8 % and 88.5 % of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families. Proteins 1999;37:360–378. � 1999 Wiley-Liss, Inc. Key words: clustering; protein families; protein classification; sequence alignment; homologous proteins
A Robust Model for Finding Optimal Evolutionary Trees
, 1993
"... Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edge-weighted tree T in which the distance d T ij in the tree between the leaves ..."
Abstract
-
Cited by 71 (12 self)
- Add to MetaCart
Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edge-weighted tree T in which the distance d T ij in the tree between the leaves of T corresponding to the species i and j exactly equals the observed distance, d ij . When such a tree exists, this is expressed in the biological literature by saying that the distance function or matrix is additive, and trees can be constructed from additive distance matrices in O(n 2 ) time. Real distance data is hardly ever additive, and we therefore need ways of modeling the problem of finding the best-fit tree as an optimization problem. In this paper we present several natural and realistic ways of modeling the inaccuracies in the distance data. In one model we assume that we have upper and lower bounds for the distances between pairs of species and try to find an additive distanc...
Blocks+: A Non-Redundant Database of Protein Alignment Blocks Derived from multiple compilations
, 1999
"... Motivation: As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in PROSITE, can be searched to classify new sequenc ..."
Abstract
-
Cited by 67 (2 self)
- Add to MetaCart
Motivation: As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in PROSITE, can be searched to classify new sequences. However, PROSITE is incomplete, and families from other databases are now available to expand coverage of the Blocks Database.
Identification of protein coding regions by database similarity search
- Nature Genetics
, 1993
"... Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly relat ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. The BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable selectivity that has been parameterized. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1 % query errors, a rate that is typical for primary nucleotide sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors. page 2
BLAST: at the core of a powerful and diverse set of sequence analysis tools
- Nucleic Acids Res
, 2004
"... Basic Local Alignment Search Tool (BLAST) is one of the most heavily used sequence analysis tools available in the public domain. There is now a wide choice of BLAST algorithms that can be used to search many different sequence databases via the BLAST web pages ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
Basic Local Alignment Search Tool (BLAST) is one of the most heavily used sequence analysis tools available in the public domain. There is now a wide choice of BLAST algorithms that can be used to search many different sequence databases via the BLAST web pages
Predicting protein structure using hidden Markov models
, 1997
"... We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov mode ..."
Abstract
-
Cited by 46 (18 self)
- Add to MetaCart
We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov models. In addition, a hidden Markov model was built for each of the target sequences, and all of the sequences in PDB were scored against that target model. Having high scores from both methods was found to be highly indicative of the target and a structure being homologous. Predictions were made based on several criteria: the scores with the structure models, the scores with the target models, consistency between the secondary structure in the known structure and predictions for the target (using the program PhD), human examination of predicted alignments between target and structure (using RASMOL), and solvation preferences in the alignment of the target and structure. The method worked well in comparison to other methods used at CASP2 for targets of moderate difficulty, where the closest structure in PDB could be aligned to the target with at least 15 % residue identity. There was no evidence for the method's e ectiveness for harder cases, where the residue identity was much lower than 15%.
Variations on Probabilistic Suffix Trees: Statistical Modeling and Prediction of Protein Families
, 2001
"... Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster. Availability: The programs are available upon request from the authors. Contact: jill@cs.huji.ac.il; golan@cs.cornell.edu

