Results 1  10
of
299
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for ..."
Abstract

Cited by 35 (18 self)
 Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any capproximation to the given clustering objective F is ǫclose to the target—then we can produce clusterings that are O(ǫ)close to the target, even for values c for which obtaining a capproximation is NPhard. In particular, for kmedian and kmeans objectives, we show that we can achieve this guarantee for any constant c> 1, and for minsum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the kmedian objective is ǫclose to the target, and assuming that any approximately optimal solution is ǫclose to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE
, 2009
"... Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST’s programs are about 100fold faster tha ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST’s programs are about 100fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST’s speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vectorparallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful logodds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (Evalues) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.
D: DrugBank 3.0: a comprehensive resource for ’omics’ research on drugs
 Nucleic Acids Res 2011, , Database Issue
"... DrugBank ..."
REBASEa database for DNA restriction and modification: enzymes, genes and genomes
 Nucleic Acids Res
, 2009
"... REBASE is a comprehensive database of information about restriction enzymes, DNA methyltransferases and related proteins involved in the biological process of restriction–modification (R–M). It contains fully referenced information about recognition and cleavage sites, isoschizomers, neoschizomers, ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
REBASE is a comprehensive database of information about restriction enzymes, DNA methyltransferases and related proteins involved in the biological process of restriction–modification (R–M). It contains fully referenced information about recognition and cleavage sites, isoschizomers, neoschizomers, commercial availability, methylation sensitivity, crystal and sequence data. Experimentally characterized homing endonucleases are also included. The fastest growing segment of REBASE contains the putative R–M systems found in the sequence databases. Comprehensive descriptions of the R–M content of all fully sequenced genomes are available including summary schematics. The contents of REBASE may be browsed from the web
The National Center for Biotechnology Information’s Protein Clusters Database
 Nucleic Acids Res
, 2009
"... Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying microorganism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Pro ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying microorganism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at
Gene3D: merging structure and function for a Thousand genomes
 Nucleic Acids Res
, 2010
"... Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website
The integrated microbial genomes system: an expanding comparative analysis resource
 Nucleic Acids Res
, 2009
"... comparative analysis resource ..."
Efficient Clustering with Limited Distance Information
"... Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all o ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. 1
Hub Promiscuity in ProteinProtein Interaction Networks
, 2010
"... Abstract: Hubs are proteins with a large number of interactions in a proteinprotein interaction network. They are the principal agents in the interaction network and affect its function and stability. Their specific recognition of many different protein partners is of great interest from the struct ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract: Hubs are proteins with a large number of interactions in a proteinprotein interaction network. They are the principal agents in the interaction network and affect its function and stability. Their specific recognition of many different protein partners is of great interest from the structural viewpoint. Over the last few years, the structural properties of hubs have been extensively studied. We review the currently known features that are particular to hubs, possibly affecting their binding ability. Specifically, we look at the levels of intrinsic disorder, surface charge and domain distribution in hubs, as compared to nonhubs, along with differences in their functional domains.