Results 1 - 10
of
50
ProtoMap: automatic classification of protein sequences and hierarchy of protein families
- Nucleic Acids Res
, 1999
"... ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.al ..."
Abstract
-
Cited by 71 (13 self)
- Add to MetaCart
ABSTRACT We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all-vs.all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these domain-based classifications for between 64.8 % and 88.5 % of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysis reveals finer subfamilies in families of known proteins as well as many novel relations between protein families. Proteins 1999;37:360–378. � 1999 Wiley-Liss, Inc. Key words: clustering; protein families; protein classification; sequence alignment; homologous proteins
Empirical statistical estimates for sequence similarity searches
- J. Mol. Biol
, 1998
"... Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements ..."
Abstract
-
Cited by 66 (3 self)
- Add to MetaCart
Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements
representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des 3
, 1998
"... Manuscript is 43 Pages in Length (including this one) ..."
Abstract
-
Cited by 44 (24 self)
- Add to MetaCart
Manuscript is 43 Pages in Length (including this one)
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census
- Proteins
, 1998
"... Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori,andE. coli are compared in terms of patterns of fold usage---whether a given fold occurs in a particular organism. Of the ,340 ..."
Abstract
-
Cited by 38 (27 self)
- Add to MetaCart
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori,andE. coli are compared in terms of patterns of fold usage---whether a given fold occurs in a particular organism. Of the ,340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in allhelical structure and enriched in mixed helixsheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many nonhomologous sequence families, and are especially similar in overall architecture---eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly ex...
Intrinsically disordered protein
- J. Mol. Graph. Model
, 2001
"... Dunker K., et al Proteins can exist in a trinity of structures: the ordered state, the molten globule and the random coil. Five examples follow which suggest that native protein structure can correspond to any of the three states (not just the ordered state) and that protein function can arise from ..."
Abstract
-
Cited by 30 (11 self)
- Add to MetaCart
Dunker K., et al Proteins can exist in a trinity of structures: the ordered state, the molten globule and the random coil. Five examples follow which suggest that native protein structure can correspond to any of the three states (not just the ordered state) and that protein function can arise from any of the three states and their transitions. 1. In a process that likely mimics infection, fd phage converts from the ordered into the disordered molten globular state. 2. Nucleosome hyperacetylation is crucial to DNA replication and transcription; this chemical modification greatly increases the net negative charge of the nucleosome core particle. We propose that the increased charge imbalance promotes its conversion to a much less rigid form. 3. Clusterin contains an ordered domain and also a native molten globular region. The molten globular domain likely functions as a proteinaceous detergent for cell remodeling and removal of apoptotic debris. 4. In a critical signaling event, a helix in calcineurin becomes bound and surrounded by calmodulin, thereby
A comparison of profile Hidden Markov Model procedures for remote homology detection
- NUCLEIC ACIDS RES
, 2002
"... Profile hidden Markov models (HMMs) are amongst the most successful procedures for detecting remote homology between proteins. There are two popular profile HMM programs, HMMER and SAM. Little is known about their performance relative to each other and to the recently improved version of PSI-BLAST. ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Profile hidden Markov models (HMMs) are amongst the most successful procedures for detecting remote homology between proteins. There are two popular profile HMM programs, HMMER and SAM. Little is known about their performance relative to each other and to the recently improved version of PSI-BLAST. Here we compare the two programs to each other and to non-HMM methods, to determine their relative performance and the features that are important for their success. The quality of the multiple sequence alignments used to build models was the most important factor affecting the overall performance of profile HMMs. The SAM T99 procedure is needed to produce high quality alignments automatically, and the lack of an equivalent component in HMMER makes it less complete as a package. Using the default options and parameters as would be expected of an inexpert user, it was found that from identical alignments SAM consistently produces better models than HMMER and that the relative performance of the model-scoring components varies. On average, HMMER was found to be between one and three times faster than SAM when searching databases larger than 2000 sequences, SAM being faster on smaller ones. Both methods were shown to have effective low complexity and repeat sequence masking using their null models, and the accuracy of their E-values was comparable. It was found that the SAM T99 iterative database search procedure performs better than the most recent version of PSI-BLAST, but that scoring of PSI-BLAST profiles is more than 30 times faster than scoring of SAM models.
The Utility of Different Representations of Protein Sequence for Predicting Functional Class
, 2001
"... Motivation: Data Mining Prediction (DMP) is a novel approach to predict protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogen ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
Motivation: Data Mining Prediction (DMP) is a novel approach to predict protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the E. coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60%, and 5% of unass...
Including Biological Literature Improves Homology Search
- In Pacific Symposium on Biocomputing 2001. Mauna Lani
, 2001
"... Introduction The sequence information generated by genome sequencing projects offers opportunities for understanding biology at an unprecedented fine level of detail. At the same time, the biomedical literature provides a record of high level biological phenomena as observed and reported over many ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Introduction The sequence information generated by genome sequencing projects offers opportunities for understanding biology at an unprecedented fine level of detail. At the same time, the biomedical literature provides a record of high level biological phenomena as observed and reported over many decades. There is an opportunity to combine the power of the genome sequence information with the published biological record to accelerate progress and gain insight. Here we show that including literature to tailor homology searches against sequence databases can improve performance. The concept of homology between two protein or nucleotide sequences is often used to infer that two genes or their protein products are related by evolution. Divergence between the two entities may have occurred when two species evolved from a single ancestor (orthologs) or when gene duplication occurs within a species (paralogs). We usually expect that homologous sequences have common functional roles

