Results 1  10
of
32
Predicting protein structure using hidden Markov models
, 1997
"... We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov mode ..."
Abstract

Cited by 58 (20 self)
 Add to MetaCart
We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov models. In addition, a hidden Markov model was built for each of the target sequences, and all of the sequences in PDB were scored against that target model. Having high scores from both methods was found to be highly indicative of the target and a structure being homologous. Predictions were made based on several criteria: the scores with the structure models, the scores with the target models, consistency between the secondary structure in the known structure and predictions for the target (using the program PhD), human examination of predicted alignments between target and structure (using RASMOL), and solvation preferences in the alignment of the target and structure. The method worked well in comparison to other methods used at CASP2 for targets of moderate difficulty, where the closest structure in PDB could be aligned to the target with at least 15 % residue identity. There was no evidence for the method's e ectiveness for harder cases, where the residue identity was much lower than 15%.
Scoring Hidden Markov Models
"... Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler s ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler statistical model intended to represent the universe of sequences as a whole, rather than the group of interest. Such scoring leads to two immediate questions: what should the null model be, and what threshold of logodds score should be deemed a match to the model. Results: This paper experimentally analyses these two issues. Within the context of the Sequence Alignment and Modeling software suite (SAM), we consider a variety ofnull models and suitable thresholds. Additionally, we consider HMMer's logodds scoring and SAM's original Zscoring method. Among the null model choices, a simple looping null model that emits characters according to the geometric mean of the character probabilities in the columns modeled by the HMM performs well or best across all four discrimination experiments.
A flexible motif search technique based on generalized profiles
 COMPUTERS AND CHEMISTRY
, 1996
"... ... generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functio ..."
Abstract

Cited by 35 (7 self)
 Add to MetaCart
... generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functions of a variety of motif descriptors implemented in other methods, including regular expressionlike patterns, weight matrices, previously used profiles, and certain types of hidden Markov models (HMMs). The relationship between generalized profiles and other biomolecular motif descriptors is analyzed in detail, with special attention to HMMs. Generalized profiles are shown to be equivalent to a particular class of HMMs, and conversion procedures in both directions are given. The conversion procedures provide an interpretation for local alignment in the framework of stochastic models, allowing for clear, simple significance tests. A mathematical statement of the motif search problem defines the new method exactly without linking it to a specific algorithmic solution. Part of the definition includes a new definition of disjointness of alignments.
SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation
 PLoS Comput Biol
"... Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (l) requires timeconsuming computational simulation. Moreover, optimal ali ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (l) requires timeconsuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (‘‘Forward’ ’ scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (‘‘Viterbi’ ’ scores) are Gumbeldistributed with constant l = log 2, and the high scoring tail of Forward scores is exponential with the same constant l. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profilehidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (Evalues) for both Viterbi and Forward scores for probabilistic local alignments.
The emergence of pattern discovery techniques in computational biology
 Metabolic Engineering
, 2000
"... In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and descri ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and describe several applications of pattern discovery to problems from computational biology. 2000 Academic Press 1.
Kay: A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System
 Proceedings of the Fourth International Conference on Inteligent Systems for Molecular Biology
, 1996
"... We present a probabilistic interpretation of local sequence alignment methods where the alignment scoring system (ASS) plays the role of a stochastic process defining a probability distribution over all sequence pairs. An explicit algorithm is given to compute the probability of two sequences given ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
We present a probabilistic interpretation of local sequence alignment methods where the alignment scoring system (ASS) plays the role of a stochastic process defining a probability distribution over all sequence pairs. An explicit algorithm is given to compute the probability of two sequences given an ASS. Based on this definition, a modified version of the SmithWaterman local similarity search algorithm has been devised, which assesses sequence relationships by log likelihood ratios. When tested on classical examplesuch as globins or Gproteincoupled receptors, the new method proved to be up to an order of magnitude more sensitive than the native SmithWaterman algorithm.
Detection of significant patterns by compression algorithms: the case of Approximate Tandem Repeats in DNA sequences.
, 1997
"... We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular ty ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Defined Ordered SequenceDNA): approximate tandem repeats of small motifs (i.e. of lengths < 4). This algorithm has been experimented over four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes. The algorithms in C are available by the World Wide Web (URL: http://www.lifl.fr/ rivals/Doc/RTA/ ).
Weighting Hidden Markov Models For Maximum Discrimination
 Bioinformatics
, 1998
"... 1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting metho ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximumdiscrimination idea. 1.2 Results One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than singlesequence Probabilistic SmithWaterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), MetaMEME (28% reduction), and unweighted SAM (31% reduction). 1.3 Availability A WorldWide Web server, as well as information on obtaining the Sequence Alignme...
The Repeat Pattern Toolkit (RPT): Analyzing the Structure and Evolution of the C. elegans Genome
, 1994
"... Over 3:6 million bases of DNA sequence from chromosome III of the C. elegans have been determined. The availability of this extended region of contiguous sequence has allowed us to analyze the nature and prevalence of repetitive sequences in the genome of a eukaryotic organism with a high gene densi ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
Over 3:6 million bases of DNA sequence from chromosome III of the C. elegans have been determined. The availability of this extended region of contiguous sequence has allowed us to analyze the nature and prevalence of repetitive sequences in the genome of a eukaryotic organism with a high gene density. We have assembled a Repeat Pattern Toolkit (RPT) to analyze the patterns of repeats occurring in DNA. The tools include identifying significant local alignments (utilizing both twoway and threeway alignments), dividing the set of alignments into connected components (signifying repeat families), computing evolutionary distance between repeat family members, constructing minimum spanning trees from the connected components, and visualizing the evolution of the repeat families. Over 7000 families of repetitive sequences were identi ed. The size of the families ranged from isolated pairs to over 1600 segments of similar sequence. Approximately 12:3 % of the analyzed sequence participates in a repeat element.
Sequence Alignment with Tandem Duplication
 J. Comp. Biol
, 1997
"... Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed i ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed indels#.