Results 1 - 10
of
39
ProtoNet 4.0: a hierarchical classification of one million protein sequences
- Nucleic Acids Res
, 2005
"... ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over onemillion proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analy-sis tools to navigate the clustering hiera ..."
Abstract
-
Cited by 32 (11 self)
- Add to MetaCart
(Show Context)
ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over onemillion proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analy-sis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised, biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12 % of the clusters. A large portion of these clusters was automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at:
Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes
, 2006
"... ..."
Statistically Rigorous Automated Protein Annotation
- BIOINFORMATICS
, 2004
"... Motivation: Assignment of putative protein functional annotation by comparative analysis using pre-defined experimental annotations is performed routinely by molecular biologists. The number and statistical significance of these assignments remains a challenge in this era of high-throughput proteomi ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Motivation: Assignment of putative protein functional annotation by comparative analysis using pre-defined experimental annotations is performed routinely by molecular biologists. The number and statistical significance of these assignments remains a challenge in this era of high-throughput proteomics. A combined statistical method that enables robust, automated protein annotation by reliably expanding existing annotation sets is described. An existing clustering scheme, based on relevant experimental information (e.g., sequence identity, keywords, or gene expression data) is required. The method assigns new proteins to these clusters with a measure of reliability. It can also provide human reviewers with a reliability score for both new and previously classified proteins.
eBLOCKs: Enumerating conserved protein blocks to achieve maximal sensitivity and specificity
- Nucl. Acids Res
, 2005
"... and specificity ..."
(Show Context)
Michigan molecular interactions (MiMI): Putting the jigsaw puzzle together. Nucleic Acids Research
, 2007
"... Protein interaction data exists in a number of repo-sitories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers dat ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
Protein interaction data exists in a number of repo-sitories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information. Utilizing an iden-tity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the users to retrieve information from many different databases at once, highlighting complementary and contradictory information. To help scientists judge the usefulness of a piece of data, MiMI tracks the provenance of all data. Finally, a simple yet powerful user interface aids users in their queries, and frees them from the onerous task of knowing the data format or learning a query language. MiMI allows scientists to query all data, whether corroborative or contradictory, and specify which sources to utilize. MiMI is part of the
The homology kernel: a biologically motivated sequence embedding into Euclidean space
- UNIVERSITY OF CALIFORNIA, SAN DIEGO
, 2004
"... Part of the challenge of modeling protein sequences is their discrete nature. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences. One way to apply these techniques to protein sequences is to em ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Part of the challenge of modeling protein sequences is their discrete nature. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this paper, we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-ofthe-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.
Family classification without domain chaining
"... Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify de ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data. contact: