Results 1 - 10
of
46
Finding Similar Regions In Many Strings
- Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NP-hard [7, 18]. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the first nontrivial better-than-2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
Finding Subtle Motifs by Branching from Sample Strings
, 2003
"... Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of su ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both pattern-based and profile-based settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.
EQUI-ENERGY SAMPLER WITH APPLICATIONS IN STATISTICAL INFERENCE AND STATISTICAL MECHANICS
, 2006
"... We introduce a new sampling algorithm, the equi-energy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperature-domain methods, the equi-energy sampler, utilizing the temperature–energy duality, targets the energy directly. The focus on the energy func ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
We introduce a new sampling algorithm, the equi-energy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperature-domain methods, the equi-energy sampler, utilizing the temperature–energy duality, targets the energy directly. The focus on the energy function not only facilitates efficient sampling, but also provides a powerful means for statistical estimation, for example, the calculation of the density of states and microcanonical averages in statistical mechanics. The equi-energy sampler is applied to a variety of problems, including exponential regression in statistics, motif sampling in computational biology and protein folding in biophysics.
A boosting approach for motif modeling using ChIP-chip data
- Bioinformatics
, 2005
"... doi:10.1093/bioinformatics/bti402 ..."
Fast Probabilistic Analysis of Sequence Function Using Scoring Matrices
, 2000
"... Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired tradeo# between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects. Results: We develop three techniques for increasing the speed of sequence analysis: probability #ltering, lookahead scoring, and permuted lookahead scoring. In probability #ltering, we compute the score threshold that corresponds to the userspeci #ed p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they wi...
The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons
- Genome Res
, 2002
"... service ..."
Fast multiple alignment of ungapped DNA sequences using information theory and a relaxation method
"... An information theory based multiple alignment ("Malign") method was used to align the DNA binding sequences of the OxyR and Fis proteins, whose sequence conservation is so spread out that it is difficult to identify the sites. In the algorithm described here, the information content of the sequence ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
An information theory based multiple alignment ("Malign") method was used to align the DNA binding sequences of the OxyR and Fis proteins, whose sequence conservation is so spread out that it is difficult to identify the sites. In the algorithm described here, the information content of the sequences is used as a unique global criterion for the quality of the alignment. The algorithm uses look-up tables to avoid recalculating computationally expensive functions such as the logarithm. Because there are no arbitrary constants and because the results are reported in absolute units (bits), the best alignment can be chosen without ambiguity. Starting from randomly selected alignments, a hill-climbing algorithm can track through the immense space of s n combinations where s is the number of sequences and n is the number of positions possible for each sequence. Instead of producing a single alignment, the algorithm is fast enough that one can afford to use many start points and to classify ...
A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length
, 2005
"... ..."
Learning Local Languages and Its Application to Protein alpha-Chain Identification
- PROC. OF 27TH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES
, 1996
"... This paper concerns an efficient algorithm for learning in the limit a special type of regular languages called locally testable languages from positive data, and its application to identifying the protein ff-chain region in amino acid sequences. First, we present a linear time algorithm that, give ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
This paper concerns an efficient algorithm for learning in the limit a special type of regular languages called locally testable languages from positive data, and its application to identifying the protein ff-chain region in amino acid sequences. First, we present a linear time algorithm that, given a locally testable language, learns (identifies) its deterministic finite state automaton in the limit from only positive data. This provides us with a practical and efficient learning method for a specific domain of symbolic analysis. We then describe several experimental results using the learning algorithm developed above. Following a theoretical observation which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language, we apply the learning algorithm to identifying the protein ff-chain region in amino acid sequences for hemoglobin. Experimental scores show an overall success rate of 95 % correct identification for positive data, an...
W: Improved spliced alignment from an information theoretic approach
- Bioinformatics
, 2006
"... The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

