Results 1  10
of
69
Hidden Markov models for detecting remote protein homologies
 Bioinformatics
, 1998
"... A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is ..."
Abstract

Cited by 308 (12 self)
 Add to MetaCart
A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAMT98 method with four datasets. Three of the test sets are foldrecognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wublastp and against doubleblast, a twostep method similar to ISS, but using blast instead of fasta. Results SAMT98 had the fewest errors in all tests dramatically so for the foldrecognition tests. At the minimumerror point on the SCOPdomains test, SAMT98 got 880 true positives and 68 false positives, doubleblast got 533 true positives with 71 false positives, and wublastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new scorenormalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
A Discriminative Framework for Detecting Remote Protein Homologies
, 1999
"... A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a ..."
Abstract

Cited by 193 (4 self)
 Add to MetaCart
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract

Cited by 130 (22 self)
 Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
MetaMEME: motifbased hidden Markov models of protein families
 Comput Appl Biosci
, 1997
"... Motivation: Modeling families of related biological sequences using Hidden Markov models (HMMs), although increasingly widespread, faces at least one major problem: because of the complexity of these mathematical models, they require a relatively large training set in order to accurately recognize a ..."
Abstract

Cited by 72 (8 self)
 Add to MetaCart
Motivation: Modeling families of related biological sequences using Hidden Markov models (HMMs), although increasingly widespread, faces at least one major problem: because of the complexity of these mathematical models, they require a relatively large training set in order to accurately recognize a given family. For families in which there are few known sequences, a standard linear HMM contains too many parameters to be trained adequately. Results: This work attempts to solve that problem by generating smaller HMMs which precisely model only the conserved regions of the family. These HMMs are constructed from motif models generated by the EM algorithm using the MEME software. Because motifbased HMMs have relatively few parameters, they can be trained using smaller data sets. Studies of short chain alcohol dehydrogenases and 4Fe4S ferredoxins support the claim that motifbased HMMs exhibit increased sensitivity and selectivity in database searches, especially when training sets contain few sequences.
Improving Prediction of Protein Secondary Structure using Structured Neural Networks and Multiple Sequence Alignments
 J. Comput. Biol
, 1996
"... The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures ffhelix, fistrand and coil. The networks are designed using a priori knowled ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures ffhelix, fistrand and coil. The networks are designed using a priori knowledge of amino acid properties with respect to the secondary structure and of the characteristic periodicity in ffhelices. Since these singlestructure networks all have less than 600 adjustable weights overfitting is avoided. To obtain a threestate prediction of ffhelix, fistrand or coil, ensembles of singlestructure networks are combined with another neural network. This method gives an overall prediction accuracy of 66.3% when using sevenfold crossvalidation on a database of 126 nonhomologous globular proteins. Applying the method to multiple sequence alignments of homologous proteins increases the prediction accuracy significantly to 71.3% with corresponding Matthews' correlation c...
Predicting protein structure using hidden Markov models
, 1997
"... We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov mode ..."
Abstract

Cited by 58 (20 self)
 Add to MetaCart
We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov models. In addition, a hidden Markov model was built for each of the target sequences, and all of the sequences in PDB were scored against that target model. Having high scores from both methods was found to be highly indicative of the target and a structure being homologous. Predictions were made based on several criteria: the scores with the structure models, the scores with the target models, consistency between the secondary structure in the known structure and predictions for the target (using the program PhD), human examination of predicted alignments between target and structure (using RASMOL), and solvation preferences in the alignment of the target and structure. The method worked well in comparison to other methods used at CASP2 for targets of moderate difficulty, where the closest structure in PDB could be aligned to the target with at least 15 % residue identity. There was no evidence for the method's e ectiveness for harder cases, where the residue identity was much lower than 15%.
Scoring Hidden Markov Models
"... Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler s ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler statistical model intended to represent the universe of sequences as a whole, rather than the group of interest. Such scoring leads to two immediate questions: what should the null model be, and what threshold of logodds score should be deemed a match to the model. Results: This paper experimentally analyses these two issues. Within the context of the Sequence Alignment and Modeling software suite (SAM), we consider a variety ofnull models and suitable thresholds. Additionally, we consider HMMer's logodds scoring and SAM's original Zscoring method. Among the null model choices, a simple looping null model that emits characters according to the geometric mean of the character probabilities in the columns modeled by the HMM performs well or best across all four discrimination experiments.
Using substitution probabilities to improve positionspecific scoring matrices
 Computer Applications in the Biosciences
, 1996
"... blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved positionspecific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching appli ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved positionspecific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial "pseudocounts " to the observed counts. We introduce a simple method for computing pseudocounts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this positionbased method outperformed other pseudocount methods and was a substantial improvement over the traditional average score method used for constructing profiles. 2
COACH: profileprofile alignment of protein families using hidden Markov models
 BIOINFORMATICS
, 2004
"... ..."
Unique and conserved features of genome and proteome of SARScoronavirus, an early splitoff from the coronavirus group 2 lineage
, 2003
"... *Corresponding authors The genome organization and expression strategy of the newly identified severe acute respiratory syndrome coronavirus (SARSCoV) were predicted using recently published genome sequences. Fourteen putative open reading frames were identified, 12 of which were predicted to be ex ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
*Corresponding authors The genome organization and expression strategy of the newly identified severe acute respiratory syndrome coronavirus (SARSCoV) were predicted using recently published genome sequences. Fourteen putative open reading frames were identified, 12 of which were predicted to be expressed from a nested set of eight subgenomic mRNAs. The synthesis of these mRNAs in SARSCoVinfected cells was confirmed experimentally. The 4382 and 7073 amino acid residue SARSCoV replicase polyproteins are predicted to be cleaved into 16 subunits by two viral proteinases (bringing the total number of SARSCoV proteins to 28). A phylogenetic analysis of the replicase gene, using a distantly related torovirus as an outgroup, demonstrated that, despite a number of unique features, SARSCoV is most closely related to group 2 coronaviruses. Distant homologs of cellular RNA processing enzymes were identified in group 2 coronaviruses, with four of them being conserved in SARSCoV. These newly recognized viral enzymes place the mechanism of coronavirus RNA synthesis in a completely new perspective. Furthermore, together with previously described viral enzymes, they will be important targets for the design of antiviral strategies aimed at controlling the further spread of SARSCoV.