Results 1  10
of
53
Combining phylogenetic and hidden Markov models in biosequence analysis
 J. Comput. Biol
, 2004
"... A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individ ..."
Abstract

Cited by 103 (12 self)
 Add to MetaCart
A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction—for gene finding, for example, or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higherorder states in the HMM, which allows for contextsensitive models of substitution— that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higherorder states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higherorder states being particularly pronounced.
A benchmark of multiple sequence alignment programs upon structural RNAs
 Nucleic Acids Res
, 2005
"... To date, few attempts have been made to benchmark the alignment algorithms upon nucleic acid sequences. Frequently, sophisticated PAM or BLOSUM like models are used to align proteins, yet equivalents are not considered for nucleic acids; instead, rather ad hoc models are generally favoured. Here, we ..."
Abstract

Cited by 91 (12 self)
 Add to MetaCart
To date, few attempts have been made to benchmark the alignment algorithms upon nucleic acid sequences. Frequently, sophisticated PAM or BLOSUM like models are used to align proteins, yet equivalents are not considered for nucleic acids; instead, rather ad hoc models are generally favoured. Here, we systematically test the performance of existing alignment algorithms on structural RNAs. This work was aimed at achieving the following goals: (i) to determine conditions where it is appropriate to apply common sequence alignment methods to the structuralRNAalignmentproblem.Thisindicates where and when researchers should consider augmenting the alignment process with auxiliary information, such as secondary structure and (ii) to determine which sequence alignment algorithms perform well under the broadest range of conditions. We find that sequence alignment alone, using the current algorithms, is generally inappropriate,50–60 % sequence identity. Second, we note that the probabilistic method ProAlign and the aging Clustal algorithms generally outperform other sequencebased algorithms, under the broadest range of applications.
BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark
 Proteins
, 2005
"... ABSTRACT Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, highthroughput technologies such as genome sequencing ..."
Abstract

Cited by 78 (1 self)
 Add to MetaCart
ABSTRACT Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, highthroughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, fulllength sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site
GM: An expectation maximization algorithm for training hidden substitution models
 J Mol Bio
"... We derive an expectation maximization algorithm for maximumlikelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve ov ..."
Abstract

Cited by 38 (7 self)
 Add to MetaCart
We derive an expectation maximization algorithm for maximumlikelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.
A hidden Markov model for progressive multiple alignment
 Bioinformatics
, 2003
"... Motivation: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleicacid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. Results: We ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
Motivation: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleicacid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. Results: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment. Availability: The software is freely available at
Multiple Sequence Alignment Accuracy and Phylogenetic Inference
"... Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogeneti ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiplesequence alignment can be an important factor in downstream effects on topological reconstruction. [Bayesian; maximum likelihood; maximum parsimony; multiple sequence alignment; neighbor
Statistical alignment based on fragment insertion and deletion models. Bioinformatics 4:490–499
"... Motivation: The topic of this paper is the estimation of alignments and mutation rates based on stochastic sequence–evolution models that allow insertions and deletions of subsequences (‘fragments’) and not just single bases. The model we propose is a variant of a model introduced by Thorne et al. ( ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
Motivation: The topic of this paper is the estimation of alignments and mutation rates based on stochastic sequence–evolution models that allow insertions and deletions of subsequences (‘fragments’) and not just single bases. The model we propose is a variant of a model introduced by Thorne et al. (J. Mol. Evol., 34, 3–16, 1992). The computational tractability of the model depends on certain restrictions in the insertion/deletion process; possible effects we discuss. Results: The process of fragment insertion and deletion in the sequence–evolution model induces a hidden Markov structure at the level of alignments and thus makes possible efficient statistical alignment algorithms. As an example we apply a sampling procedure to assess the variability in alignment and mutation parameter estimates for HVR1 sequences of human and orangutan, improving results of previous work. Simulation studies give evidence that estimation methods based on the proposed model also give satisfactory results when applied to data for which the restrictions in the insertion/deletion process do not hold. Availability: The source code of the software for sampling alignments and mutation rates for a pair of DNA sequences according to the fragment insertion and deletion model is freely available from
Phylogenetic hidden Markov models
 IN STATISTICAL METHODS IN MOLECULAR EVOLUTION
, 2005
"... Phylogenetic hidden Markov models, or phyloHMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of tw ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
Phylogenetic hidden Markov models, or phyloHMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of two Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree)—these models allow aspects of both sequence structure and sequence evolution to be captured. Moreover, as we will discuss, they permit key computations to be performed exactly and efficiently. PhyloHMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence “segmentation, ” such as gene prediction and the identification of conserved elements. PhyloHMMs were first proposed as a way of improving phylogenetic models that allow for variation among sites in the rate of substitution [8, 52]. Soon afterward, they were adapted for the problem of secondary structure
An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees
, 2003
"... We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary kleaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O. p 5 k / states and leads to a ti ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary kleaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O. p 5 k / states and leads to a time complexity of O.5 k L k /, where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O.2 k L k / and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.
Using evolutionary expectation maximization to estimate indel rates, Bioinformatics 21
, 2005
"... Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the InsideOutside algorithm (for stochastic contextfree grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the InsideOutside algorithm (for stochastic contextfree grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiplesequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. Results: We present an algorithm for maximumlikelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the singleresidue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91 ’ model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm’s close similarity to the Baum–Welch algorithm for training hidden Markov models, it can be used in an ‘unsupervised ’ fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. Availability: Software implementing the algorithm and the benchmark is available under GPL from