Results 1  10
of
81
PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny
 PLoS Comput Biol 2005
"... A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This anal ..."
Abstract

Cited by 83 (5 self)
 Add to MetaCart
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and ‘‘background’ ’ intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markovchain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motiffinding
Additivity in protein–DNA interactions: how good an approximation is it
 Nucleic Acids Res
, 2002
"... Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding af®nity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the ®rst study, the analysis o ..."
Abstract

Cited by 79 (11 self)
 Add to MetaCart
Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding af®nity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the ®rst study, the analysis of binding af®nity data from the Mnt repressor protein bound to all possible DNA (sub)targets at positions 16 and 17 of the binding site, showed that those positions are not independent. In the second study, the authors analysed DNA binding af®nity data of the wildtype mouse EGR1 protein and four variants differing on the middle ®nger. The binding af®nity of these proteins was measured to all 64 possible trinucleotide (sub)targets of the middle ®nger using microarray technology. The analysis of the measurements also showed interdependence among the positions in the DNA target. In the present report, we review the data of both studies and we reanalyse them using various statistical methods, including a comparison with a multiple regression approach. We conclude that despite the fact that the additivity assumption does not ®t the data perfectly, in most cases it provides a very good approximation of the true nature of the speci®c protein±DNA interactions. Therefore, additive models can be very useful for the discovery and prediction of binding sites in genomic DNA.
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site
 Problem,” Proc. Seventh Int’l Conf. Intelligent Systems for Molecular Biology
, 1999
"... This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for t ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for the Ribosome Binding Site Problem, which is to identify the short mRNA 5 ′ untranslated sequence that is recognized by the ribosome during initiation of protein synthesis. Experiments were performed to solve this problem for each of fourteen sequenced prokaryotes, by applying the method to the full complement of genes from each. One of the interesting results of this experimentation is evidence that the recognized sequence of the thermophilic archaea A. fulgidus, M. jannaschii, M. thermoautotrophicum, andP. horikoshii may be somewhat different than the well known ShineDalgarno sequence.
Finding Similar Regions In Many Strings
 Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NPhard [7, 18]. [3] gives a polynomial time algorithm for constant d. For superlogarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for superlogarithmic d). We settle the problem with a PTAS. We then give the first nontrivial betterthan2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
A boosting approach for motif modeling using ChIPchip data
 Bioinformatics
, 2005
"... Motivation: Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones. This is an important step towards the understanding of gene regulation. Results: This paper describes a boosting approach for modeling TFDN ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
Motivation: Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones. This is an important step towards the understanding of gene regulation. Results: This paper describes a boosting approach for modeling TFDNA binding. Different from the widely used weight matrix model, which predicts TFDNA binding based on a linear combination of positionspecific contributions, our approach builds a TF binding classifier by combining a set of weightmatrixbased classifiers, thus yielding a nonlinear binding decision rule. The proposed approach is applied to the ChIPchip data of Saccharomyces cerevisiae. When compared to the weight matrix method, our new approach shows significant improvements on the specificity in a majority of cases. Contact:
Finding Subtle Motifs by Branching from Sample Strings
, 2003
"... Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of su ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both patternbased and profilebased settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.
Similarity of position frequency matrices for transcription factor binding sites
 Bioinformatics
, 2005
"... Motivation: Transcriptionfactor binding sites in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices. The ability to compare position frequency matrices representing binding sites is especially important for de novo sequence motif discovery, where it is de ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Motivation: Transcriptionfactor binding sites in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices. The ability to compare position frequency matrices representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices. Results: We describe a position frequency matrix similarity quantification method based on productmultinomial distributions, demonstrate its ability to identify position frequency matrix similarity and show that it has a better false positive to false negative ratio compared to existing methods. We group transcription factor binding site frequency matrices from two libraries into matrix families, and identify the matrices that are common and unique to these libraries. We identify similarities and differences between the skeletalmusclespecific and nonmusclespecific frequency matrices for the binding sites of Mef2, Myf, Sp1, SRF and TEF of Wasserman and Fickett (1998). We further identify known frequency matrices and matrix families that are strongly similar to the matrices given by Wasserman and Fickett. We provide methodology and tools to compare and query libraries of frequency matrices for transcription factor binding sites. Availability: Software is available to use over the web at
EQUIENERGY SAMPLER WITH APPLICATIONS IN STATISTICAL INFERENCE AND STATISTICAL MECHANICS
, 2006
"... We introduce a new sampling algorithm, the equienergy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperaturedomain methods, the equienergy sampler, utilizing the temperature–energy duality, targets the energy directly. The focus on the energy func ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
We introduce a new sampling algorithm, the equienergy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperaturedomain methods, the equienergy sampler, utilizing the temperature–energy duality, targets the energy directly. The focus on the energy function not only facilitates efficient sampling, but also provides a powerful means for statistical estimation, for example, the calculation of the density of states and microcanonical averages in statistical mechanics. The equienergy sampler is applied to a variety of problems, including exponential regression in statistics, motif sampling in computational biology and protein folding in biophysics.
Computational identification of transcriptional regulatory elements in DNA sequence
, 2006
"... Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computatio ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and highthroughput experimental methods for mapping proteinbinding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cisregulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
Fast Probabilistic Analysis of Sequence Function Using Scoring Matrices
, 2000
"... Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired tradeoff between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in largescale sequencing and annotation projects.