## Markovian Structures in Biological Sequence Alignments (1999)

Venue: | Journal of the American Statistical Association |

Citations: | 20 - 7 self |

### BibTeX

@ARTICLE{Liu99markovianstructures,

author = {Jun S. Liu and Andrew F. Neuwald and Charles E. Lawrence},

title = {Markovian Structures in Biological Sequence Alignments},

journal = {Journal of the American Statistical Association},

year = {1999},

volume = {94},

pages = {1--15}

}

### OpenURL

### Abstract

this article, we provide a coherent view of the two recent models used for multiple sequence alignment --- the hidden Markov model (HMM) and the block-based motif model --- in order to develop a set of new algorithms that enjoy both the sensitivity of the block-based model and the flexibility of the HMM. In particular, we decompose the standard HMM into two components: the insertion component, which is captured by the socalled "propagation model," and the deletion component, which is described by a deletion vector. Such a decomposition serves as a basis for rational compromise between biological specificity and model flexibility. Furthermore, we introduce a Bayesian model selection criterion that --- in combination with the propagation model, genetic algorithm, and other computational aspects --- forms the core of PROBE, a multiple alignment and database search methodology (software available via anonymous ftp at ftp://ncbi.nlm.nih.gov/pub/neuwald/probe1.0). The application of our method to a GTPase family of protein sequences yields an alignment that is confirmed by comparison with known tertiary structures.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...log p( A) \Gamma log p( A j R): Upper and lower bounds for log p(R) based on logMAP are as follows: logMAPslog p(R)slogMAP \Gamma E AjR flog p(A j R)g: (9) Furthermore, by the information inequality (=-=Cover and Thomas 1991-=-) that for any non-degenerate distribution q(A), E AjR flog P (A j R)gsE AjR flog q(A)g; the second inequality of (9) can be replaced by log p(R)slog P (A j R)g \Gamma E AjR flog q(A)g and can be esti... |

3779 | Basic local alignment search tool - Altschul, Gish, et al. - 1990 |

2468 | CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:46734680 - Thompson, Higgins, et al. - 1994 |

1461 |
Identification of common molecular subsequences
- Smith, Waterman
- 1981
(Show Context)
Citation Context ...ever algorithms with favorable time and space complexity for solving the combinatoric optimization problem associated with pairwise sequence alignments have been developed (Needleman and Wunsch 1970; =-=Smith and Waterman 1981-=-). These alignment algorithms provide for flexibility in alignment by permitting insertion/deletions between all residues of a sequence. However, a large number of parameters and an associated loss of... |

1410 | A general method applicable to the search for similarities in the amino acid sequence of two proteins - Needleman, Wunsch - 1970 |

1047 | Improved tools for biological sequence comparison - Pearson, Lipman - 1988 |

983 | Bayes factors - Kass, Raftery - 1995 |

506 |
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment
- Lawrence, Altschul, et al.
- 1993
(Show Context)
Citation Context ...statistical models for multiple alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly 1990; =-=Lawrence et al. 1993-=-; Liu 1994; Liu et al. 1995; Neuwald, Liu and Lawrence 1995); and the hidden Markov model (HMM) that treats the observed sequences as though they were generated by a hypothetical ancestral model via m... |

364 |
A Tutorial on
- Rabiner
- 1989
(Show Context)
Citation Context ...Alignment The HMM, as initially introduced in the late 1960s, is a powerful statistical modeling tool and has been widely applied in signal processing, speech recognition, time series analysis, etc. (=-=Rabiner 1989-=-). The method was first applied to model biological sequences by Churchill (1989) and has become very popular recently in multiple sequence alignment (Baldi et al. 1994; Krogh et al. 1994; Lazareva an... |

327 | Sequence logos: a new way to display consensus sequences. Nucleic Acids Res - Schneider, Stephens - 1990 |

265 | Introduction to computational biology - Waterman - 2000 |

168 | Posterior predictive assessment of model fitness via realized discrepancies - Gelman, Meng, et al. - 1996 |

153 |
Hidden Markov models of biological primary sequence information
- Baldi, Chauvin, et al.
- 1994
(Show Context)
Citation Context ...gnition, time series analysis, etc. (Rabiner 1989). The method was first applied to model biological sequences by Churchill (1989) and has become very popular recently in multiple sequence alignment (=-=Baldi et al. 1994; Krogh -=-et al. 1994; Lazareva and Churchill 1997). The basic form of an HMM can be written as y t �� f t (y j h t ) (2) h t �� g t (h j h t\Gamma1 ) (3) where f t and g t are probability distributions... |

145 | Multiple alignment using hidden Markov models
- Eddy
- 1995
(Show Context)
Citation Context ...995); and the hidden Markov model (HMM) that treats the observed sequences as though they were generated by a hypothetical ancestral model via mutation (Baldi, Chauvin, McClure, and Hunkapiller 1994; =-=Eddy 1995-=-; Krogh, Brown, Mian, Sjolander, and Haussler 1994). By using a model similar to the HMM to describe how two sequences relate to each other, Allison and Wallace (1994) presented useful algorithms for ... |

124 | Pfam: a comprehensive database of protein domain families based on seed alignments - Sonnhammer, Eddy, et al. - 1997 |

115 | Stochastic models for heterogeneous DNA sequences - Churchill - 1989 |

113 | Bayesianly justifiable and relevant frequency calculations for the applied statistician - Rubin, B - 1984 |

112 |
Bayesian models for multiple local sequence alignment and Gibbs sampling strategies
- Liu, Neuwald, et al.
- 1995
(Show Context)
Citation Context ...alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly 1990; Lawrence et al. 1993; Liu 1994; =-=Liu et al. 1995-=-; Neuwald, Liu and Lawrence 1995); and the hidden Markov model (HMM) that treats the observed sequences as though they were generated by a hypothetical ancestral model via mutation (Baldi, Chauvin, Mc... |

104 | Gibbs motif sampling: detection of bacterial outer membrane protein repeats
- Neuwald, Liu, et al.
- 1995
(Show Context)
Citation Context ...he algorithm. Block motif-based Gibbs sampling strategies have the ability to align subtly related sequences even when the number of sequences available for analysis is limited (Lawrence et al. 1993; =-=Neuwald et al. 1995-=-). They achieve this added sensitivity by employing two basic characteristics of functionally related proteins: 1) Point mutations and recombinations tend to be limited in functionally or structurally... |

104 | An evolutionary model for maximum likelihood alignment of DNA sequences - JL, Kishino, et al. - 1991 |

97 |
Automated assembly of protein blocks for database searching
- Henikoff, Henikoff
- 1991
(Show Context)
Citation Context ...djusting for the large number of multiple comparisons, the comparison scores obtained by chance from random sequences are creeping into the range of the comparison scores for truly related sequences (=-=Henikoff and Henikoff 1991-=-; Claverie 1996). Two statistical models for multiple alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawr... |

95 | Sampling and Bayes’ inference in scientific modelling and robustness - Box - 1980 |

78 |
The Collapsed Gibbs Sampler in Bayesian Computations With Applications to a Gene Regulation Problem
- Liu
- 1994
(Show Context)
Citation Context ... multiple alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly 1990; Lawrence et al. 1993; =-=Liu 1994-=-; Liu et al. 1995; Neuwald, Liu and Lawrence 1995); and the hidden Markov model (HMM) that treats the observed sequences as though they were generated by a hypothetical ancestral model via mutation (B... |

65 | Inching towards reality: An improved likelihood model of sequence evolution - Thorne, Kishino, et al. - 1992 |

59 | Extracting protein alignment models from the sequence database. Nucleic Acids Res
- Neuwald, Liu, et al.
- 1997
(Show Context)
Citation Context ... described in Section 4. This procedure, in combination with an improved alignment algorithm (described in Section 3), is a key feature of PROBE, a multiple alignment and database search methodology (=-=Neuwald et al. 1997-=-). 1.3 Modeling the Sequence Alignment Current biopolymer sequences are believed to have arisen from a common ancestral DNA sequence through evolution. This evolutionary process consists of two types ... |

57 | Protein modeling using hidden Markov models: Analysis of globins
- HAUSSLER, KROGH, et al.
- 1992
(Show Context)
Citation Context ...s analysis, etc. (Rabiner 1989). The method was first applied to model biological sequences by Churchill (1989) and has become very popular recently in multiple sequence alignment (Baldi et al. 1994; =-=Krogh et al. 1994; Lazare-=-va and Churchill 1997). The basic form of an HMM can be written as y t �� f t (y j h t ) (2) h t �� g t (h j h t\Gamma1 ) (3) where f t and g t are probability distributions (known up to some ... |

53 | Automated construction and graphical presentation of protein blocks from unaligned sequences - Henikoff, Henikoff, et al. - 1995 |

30 |
Maximum likelihood alignment of DNA sequences
- Bishop, Thompson
- 1986
(Show Context)
Citation Context ...of progeny. This process can be represented by an evolutionary tree which is rarely observable. Some methods that incorporate the evolutionary process to align pairs of sequences have been described (=-=Bishop and Thompson 1986-=-; Thorne, Kishino, and Felsenstein 1991, 1992; Allison, Wallace, and Yee 1992). More recently, Zhu, Liu and Lawrence (1997) proposed a Bayesian alignment procedure which produces the posterior distrib... |

29 | Chance and statistical significance in protein and DNA sequence analysis - Karlin, Brendel - 1992 |

20 | S.: “The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments - Allison, Wallace - 1994 |

12 | Minimum message length encoding, evolutionary trees and multiple-alignment - Allison, Wallace, et al. - 1992 |

9 | An expectation maximization algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Struct - Lawrence, Reilly - 1990 |

7 | On the statistical significance of nucleic acid similarities - Lipman, Wilbur, et al. - 1984 |

4 |
Effective large-scale sequence similarity searches
- Claverie
- 1996
(Show Context)
Citation Context ...r of multiple comparisons, the comparison scores obtained by chance from random sequences are creeping into the range of the comparison scores for truly related sequences (Henikoff and Henikoff 1991; =-=Claverie 1996-=-). Two statistical models for multiple alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly... |

4 | Bayesian restoration of a hidden Markov chain with applications to DNA sequencing - Churchill, Lazareva - 1999 |

2 | Motifs and Structural Fold of the Cofactor Binding Site of Human Glutamate Decarboxylase - Martin, L, et al. - 1998 |

1 | Associated with Hereditary Non-polyposis Colon Cancer - Earabino, Lipford, et al. - 1994 |

1 | Extended Homology Prediction for Motif Structure by Multiple Sequence Alignment," Modeling and Scientific Computing - K, Lawrence - 1997 |