Results 1 - 10
of
36
Modeling Dependencies in Protein-DNA Binding Sites
, 2003
"... The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation of transcription factor binding sites is aposition specific score matrix (PSSM). This representation makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that these richer representations improve over the PSSM model in both tasks.
Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals
- J. Comput. Biol
, 2004
"... ..."
A boosting approach for motif modeling using ChIP-chip data
- Bioinformatics
, 2005
"... doi:10.1093/bioinformatics/bti402 ..."
A simple physical model for the prediction and design of protein–DNA interactions
- J. Mol. Biol
, 2004
"... Protein–DNA interactions are crucial for many biological processes. Attempts to model these interactions have generally taken the form of amino acid–base recognition codes or purely sequence-based profile methods, which depend on the availability of extensive sequence and structural information for ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
Protein–DNA interactions are crucial for many biological processes. Attempts to model these interactions have generally taken the form of amino acid–base recognition codes or purely sequence-based profile methods, which depend on the availability of extensive sequence and structural information for specific structural families, neglect side-chain conformational variability, and lack generality beyond the structural family used to train the model. Here, we take advantage of recent advances in rotamer-based protein design and the large number of structurally characterized protein–DNA complexes to develop and parameterize a simple physical model for protein–DNA interactions. The model shows considerable promise for redesigning amino acids at protein–DNA interfaces, as design calculations recover the amino acid residue identities and conformations at these interfaces with accuracies comparable to sequence recovery in globular proteins. The model shows promise also for predicting DNA-binding specificity for fixed protein sequences: native DNA sequences are selected correctly from pools of competing DNA substrates; however, incorporation of backbone movement will likely be required to improve performance in homology modeling applications. Interestingly, optimization of zinc finger protein amino acid sequences for high-affinity binding to specific DNA sequences results in proteins with little or no predicted specificity, suggesting that naturally occurring DNA-binding proteins are optimized for specificity rather than affinity. When combined with algorithms that optimize specificity directly, the simple computational model developed here should be useful for the engineering of proteins with novel DNA-binding specificities.
Efficient exact p-value computation for small sample, sparse, and surprising categorical data
- J. of Comp. Bio
, 2004
"... A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing p-values. In this paper, we define a generic branchand-bound approach to efficient exact p-value computation and enumerate the required conditions for successful application ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing p-values. In this paper, we define a generic branchand-bound approach to efficient exact p-value computation and enumerate the required conditions for successful application. Explicit procedures are developed for the entire Cressie–Read family of statistics, which includes the widely used Pearson and likelihood ratio statistics in a one-way frequency table goodness-of-fit test. This new formulation constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software. The general techniques we develop to exploit the convexity of many statistics are also shown to carry over to contingency table tests, suggesting that they are readily extendible to other tests and test statistics of interest. Our empirical results demonstrate a speed-up of orders of magnitude over the exhaustive computation, significantly extending the practical range for performing exact tests. We also show that the relative speed-up gain increases as the null hypothesis becomes sparser, that computation precision increases with increase in speed-up, and that computation time is very moderately affected by the magnitude of the computed p-value. These qualities make our algorithm especially appealing in the regimes of small samples, sparse null distributions, and rare events, compared to the alternative asymptotic approximations and Monte Carlo samplers. We discuss several established bioinformatics applications, where small sample size, small expected counts in one or more categories (sparseness), and very small p-values do occur. Our computational framework could be applied in these, and similar cases, to improve performance. Key words: p-value, exact tests, branch and bound, real extension, categorical data.
A.: Nucleosome Occupancy Information Improves de novo Motif Discovery
- RECOMB 2007. LNCS (LNBI
, 2007
"... Abstract. A complete understanding of transcriptional regulatory processes in the cell requires identification of transcription factor binding sites on a genomewide scale. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more mat ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract. A complete understanding of transcriptional regulatory processes in the cell requires identification of transcription factor binding sites on a genomewide scale. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known transcription factor binding sites occur in the genome than are actually functional. Chromatin structure is known to play an important role in guiding transcription factors to those sites that are functional. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling transcription factors to bind DNA in those regions [1]. In this paper, we describe a novel algorithm which employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy; the nucleosome occupancy information comes from a recently published computational model [2]. When a Gibbs sampling algorithm with our informative prior is applied to yeast sequencesets identified by ChIP-chip [3], the correct motif is found in 50 % more cases than with an uninformative uniform prior. Moreover, if nucleosome occupancy information is not available, our informative prior reduces to a new kind of prior that can exploit discriminative information in a purely generative setting. 1
The MAPPER database: a multi-genome catalog of putative transcription factor binding sites
- Nucleic Acids Res
, 2005
"... We describe a comprehensive map of putative transcription factor binding sites (TFBSs) across multiple genomes created using a search method that relies on hidden Markov models built from experimentally determined TFBSs. Using the information in the TRANSFAC and JASPAR databases, we built 1134 model ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We describe a comprehensive map of putative transcription factor binding sites (TFBSs) across multiple genomes created using a search method that relies on hidden Markov models built from experimentally determined TFBSs. Using the information in the TRANSFAC and JASPAR databases, we built 1134 models for TFBSs and used them to scan regions 10 kb upstream of the start of the transcript for all known genes in the human, mouse and Drosophila melanogaster genomes. The results, together with homology information on clusters of ortholog genes across the three genomes, were used to create a multiorganism catalog of annotated TFBSs. The catalog can be queried through a web interface accessible at
Sequence features of DNA binding sites reveal structural class of associated transcription factor
- Bioinformatics
, 2006
"... Motivation: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs re ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Motivation: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs recognize and bind their DNA counterparts according to the structure of their DNA-binding domains (e.g. zinc finger, leucine zipper, homeodomain). The structure of thesedomains can be used as a basis for grouping TFs into classes. Although the structure of DNAbinding domains varies widely across TFs generally, the TFs within a particular class bind to DNA in a similar fashion, suggesting the existence of class-specific features in the DNA sequences bound by each class of TFs. Results: In this paper, we apply a sparse Bayesian learning algorithm to identify a small set of class-specific features in the DNA sequences bound by different classes of TFs; the algorithm simultaneously learns a true multi-class classifier that uses these features to predict the DNA-binding domain of the TF that recognizes a particular set of DNA sequences. We train our algorithm on the six largest classes in TRANSFAC, comprising a total of 587 TFs. We learn a six-class classifier for this training set that achieves 87 % leave-one-out crossvalidation accuracy. We also identify features within cis-regulatory sequences that are highly specific to each class of TF, which has significant implications for how TF binding sites should be modeled for the purpose of motif discovery.
A combined model and a varied gibbs sampling algorithm used for motif discovery
- In ACM International Conference Proceeding Series; Proceedings of the second conference on Asia-Pacific bioinformatics
, 2004
"... The conserved sequences in gene regulatory regions dominate gene regulation. Discovering these sequences and their functions is important in post genome era. A novel model is constructed to represent conserved motifs of DNA sequences. This model is a combination of PWM and WAM models. The advantage ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The conserved sequences in gene regulatory regions dominate gene regulation. Discovering these sequences and their functions is important in post genome era. A novel model is constructed to represent conserved motifs of DNA sequences. This model is a combination of PWM and WAM models. The advantage is the new model not only can comprise individual base frequencies in the motifs, but also can embody relationship of neighbourhood bases. In addition, a varied Gibbs sampling algorithm is applied with consideration of the different motif occurrences in each sequence. This variation is more accordant with the true situation of gene transcription controlling mechanism. By combining the model and the discovery algorithm, a program is constructed. After analysed a set of DNA sequences of upstream regions of genes using this program, putative motifs are discovered and are compared to experimental verified regulatory sequences. Results showed that this combination is ideal for motif discovery and the practice is meaningful for gene regulation research..

