Results 1 - 10
of
14
Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census
- Proteins
, 1998
"... Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori,andE. coli are compared in terms of patterns of fold usage---whether a given fold occurs in a particular organism. Of the ,340 ..."
Abstract
-
Cited by 38 (27 self)
- Add to MetaCart
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori,andE. coli are compared in terms of patterns of fold usage---whether a given fold occurs in a particular organism. Of the ,340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in allhelical structure and enriched in mixed helixsheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many nonhomologous sequence families, and are especially similar in overall architecture---eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly ex...
The Utility of Different Representations of Protein Sequence for Predicting Functional Class
, 2001
"... Motivation: Data Mining Prediction (DMP) is a novel approach to predict protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogen ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
Motivation: Data Mining Prediction (DMP) is a novel approach to predict protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the E. coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60%, and 5% of unass...
Evolutionary analysis by whole-genome comparisons
- Journal of Bacteriology
, 2002
"... A total of 37 complete genome sequences of bacteria, archaea, and eukaryotes were compared. The percentage of orthologous genes of each species contained within any of the other 36 genomes was established. In addition, the mean identity of the orthologs was calculated. Several conclusions result: (i ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
A total of 37 complete genome sequences of bacteria, archaea, and eukaryotes were compared. The percentage of orthologous genes of each species contained within any of the other 36 genomes was established. In addition, the mean identity of the orthologs was calculated. Several conclusions result: (i) a greater absolute number of orthologs of a given species is found in larger species than in smaller ones; (ii) a greater percentage of the orthologous genes of smaller genomes is contained in other species than is the case for larger genomes, which corresponds to a larger proportion of essential genes; (iii) before species can be specifically related to one another in terms of gene content, it is first necessary to correct for the size of the genome; (iv) eukaryotes have a significantly smaller percentage of bacterial orthologs after correction for genome size, which is consistent with their placement in a separate domain; (v) the archaebacteria are specifically related to one another but are not significantly different in gene content from the bacteria as a whole; (vi) determination of the mean identity of all orthologs (involving hundreds of gene comparisons per genome pair) reduces the impact of errors in misidentification of orthologs and to misalignments, and thus it is far more reliable than single gene comparisons; (vii) however, there is a maximum amount of change in protein sequences of 37% mean identity, which limits the use of percentage sequence identity to the lower taxa, a result which should also be true for single gene comparisons of both proteins and rRNA; (viii) most of the species that appear to be
Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model
- MOLECULAR BIOLOGY AND EVOLUTION
, 2009
"... Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always r ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always reveal conclusively the history of large families. To study the evolution of complete gene repertoires, we propose here a mathematical framework that does not rely on resolved gene family histories. We show that so-called phylogenetic profiles, formed by family sizes across multiple genomes, are sufficient to infer principal evolutionary trends. The main novelty in our approach is an efficient algorithm to compute the likelihood of a phylogenetic profile in a model of birth-and-death processes acting on a phylogeny.
We examine known gene families in 28 archaeal genomes using a probabilistic model that involves lineage- and family-specific components of gene acquisition, duplication, and loss. The model enables us to consider all possible histories when inferring statistics about archaeal evolution. According to our reconstruction, most lineages are characterized by a net loss of gene families. Major increases in gene repertoire have occurred only a few times. Our reconstruction underlines the importance of persistent streamlining processes in shaping genome composition in Archaea. It also suggests that early archaeal genomes were as complex as typical modern ones, and even show signs, in the case of the methanogenic ancestor, of an extremely large gene repertoire.
Assessing performance of orthology detection strategies applied to eukaryotic genomes
- PLoS One
, 2007
"... Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-s ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale ‘gold standard ’ orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity.80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus
Mathematical modeling for functional divergence after gene duplication
- J Comput Biol
"... In this paper, I present a statistical framework for modeling the functional divergence after gene duplication. A rate-component model to describe the rate covariation among homologous genes of a gene family is implemented when a phylogenetic tree is known. The Markov chain model is rigorous but may ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper, I present a statistical framework for modeling the functional divergence after gene duplication. A rate-component model to describe the rate covariation among homologous genes of a gene family is implemented when a phylogenetic tree is known. The Markov chain model is rigorous but may require a huge amount of computational time when the number of sequences is large. On the other hand, the Poisson-based model is mathematically analytical so that computation is very fast even for a large dataset. Moreover, under the posterior framework, we have developed a site-speci � c pro � le for predicting important amino acid residues responsible for these functional differences between member genes of a gene family. Our study may have great potential for functional genomics because it is cost-effective, and these predictions can be further tested by biological experimentation. Key words: functional divergence, gene duplication, Markov chain model, Poisson-based model, posterior prediction.
An Infrastructure for Comparative Genomics to Functionally Characterize Genes and Proteins
- Genome Inform. Ser. Workshop Genome Inform
, 2000
"... Current geno e projects areres--Rxx in a flood ofsU--[Um data. The interpretation ofthes shesk5k is lagging, and opti ized dataanalys) slys)x5m need to be developed. Much can be learned fro co paring di#erent genoes as genoes ofdisRfi torganis s aysfiR5 encode proteins with highsghmxk ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Current geno e projects areres--Rxx in a flood ofsU--[Um data. The interpretation ofthes shesk5k is lagging, and opti ized dataanalys) slys)x5m need to be developed. Much can be learned fro co paring di#erent genoes as genoes ofdisRfi torganis s aysfiR5 encode proteins with highsghmxkR s ilarity. The order ofgenes (co linearity) in genoes ayals be cons-fi ed to s e extend. We have e ployed boththes obs) ations to create a ulti-functional, co putationalanalyso snal (geno eSCOUT t ), which allows for rapid identification and functional characterization ofgenes andproteins through geno e coparis)k With a nu ber of independent algoriths infor ation about di#erent levels of protein ho ology (concerning e.g.paralogs orthologs and clusRxoforthologous groups COGs and gene orderis collected andsdmU5 in sm eral value added databas-fi Thes databas- are then usn for interactive coparis) of genoes andsdmfik)[fi t analys#m The applicationis basl on the welles-[5-5m#U data integrationsgra SRS.This enssfi (1) fas handling of large geno ic data stam (2)s)mUfi) tforwardacces to a ultitude of biological databaslm (3) unique linkingfunctions betweenthes databasm# (4) highly e#cient collection of infor ation ongenes andproteins and 5. fully integrated and usm friendly graphical represc tations ofsmxk hres-fiUm This application can beus) for projects as diversas the correct annotation of genoes the opti ization of ( icro)organis s for indus-5m# production, or the identification of drugtargets [22]. Keywords: genomecomparisx# proteinhomologs orthologs COGs 1
New Enzymes from Environmental Cassette Arrays: Functional attributes of a phosphotransferase and an RNA-methyltransferase
, 2004
"... By targeting gene cassettes by polymerase chain reaction (PCR) directly from environmentally derived DNA, we are able to amplify entire open reading frames (ORFs) independently of prior sequence knowledge. ..."
Abstract
- Add to MetaCart
By targeting gene cassettes by polymerase chain reaction (PCR) directly from environmentally derived DNA, we are able to amplify entire open reading frames (ORFs) independently of prior sequence knowledge.

