Results 1 - 10
of
66
Hidden Markov models for detecting remote protein homologies
- Bioinformatics
, 1998
"... A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is ..."
Abstract
-
Cited by 229 (12 self)
- Add to MetaCart
A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAM-T98 method with four datasets. Three of the test sets are fold-recognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wu-blastp and against double-blast, a two-step method similar to ISS, but using blast instead of fasta. Results SAM-T98 had the fewest errors in all tests| dramatically so for the fold-recognition tests. At the minimum-error point on the SCOP-domains test, SAM-T98 got 880 true positives and 68 false positives, double-blast got 533 true positives with 71 false positives, and wu-blastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new score-normalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
ProtoMap: automatic classification of protein sequences and hierarchy of protein families
- Nucleic Acids Res
, 1999
"... ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.al ..."
Abstract
-
Cited by 71 (13 self)
- Add to MetaCart
ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domain-based classifications for between 64.8 % and 88.5 % of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families. Proteins 1999;37:360–378. � 1999 Wiley-Liss, Inc. Key words: clustering; protein families; protein classification; sequence alignment; homologous proteins
Empirical statistical estimates for sequence similarity searches
- J. Mol. Biol
, 1998
"... Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements ..."
Abstract
-
Cited by 66 (3 self)
- Add to MetaCart
Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements
Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures
- J Mol Biol
, 2005
"... The problem of aligning, or establishing a correspondence between, residues of two protein Abbreviations used: ROC, receiver operating ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
The problem of aligning, or establishing a correspondence between, residues of two protein Abbreviations used: ROC, receiver operating
Cluster-Preserving Embedding of Proteins
, 1999
"... Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as F ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exhaustive search is more accurate. Therefore, there are times when one would like to perform an exhaustive search.
Variations on Probabilistic Suffix Trees: Statistical Modeling and Prediction of Protein Families
, 2001
"... Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster. Availability: The programs are available upon request from the authors. Contact: jill@cs.huji.ac.il; golan@cs.cornell.edu
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Pcons: A neural-network-based consensus predictor that improves fold recognition
- Protein Sci
, 2001
"... improves fold recognition ..."
Global Self Organization of All Known Protein Sequences Reveals Inherent Biological Signatures
, 1997
"... A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small metric distortion. A novel self-organized cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural protein data. Motifs and domains such as the Zinc Finger, EF hand, Homeobox, EGF-like and others are automatically correctly identified. A novel representation of protein families is introduced, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporters family. The self organization method prese...
Assessing The Performance Of Fold Recognition Methods By Means Of A Comprehensive Benchmark.
- Pac. Symp. Biocomput
, 1996
"... this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the r ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the ranking and significance estimation procedures used in the method being evaluated. Thus, it allows a systematic comparison of different methods. Benchmarks are routinely used to assess performance of sequence-sequence alignment (e.g.

