Results 1 - 10
of
33
A boosting approach for motif modeling using ChIP-chip data
- Bioinformatics
, 2005
"... doi:10.1093/bioinformatics/bti402 ..."
MotifPrototyper: a Bayesian Profile Model for Motif Families
- Proc. Natl Acad. Sci. USA
, 2004
"... In this paper, we address the problem of modeling generic features of structurally but not textually related DNA motifs, that is, motifs whose consensus sequences are entirely di#erent, but nevertheless share "meta-sequence features" reflecting similarities in the DNA binding domains of their ass ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In this paper, we address the problem of modeling generic features of structurally but not textually related DNA motifs, that is, motifs whose consensus sequences are entirely di#erent, but nevertheless share "meta-sequence features" reflecting similarities in the DNA binding domains of their associated protein recognizers. We present MotifPrototyper, a profile Bayesian model which can capture structural properties typical of particular families of motifs. Each family corresponds to transcription regulatory proteins with similar types of structural signatures in their DNA binding domains. We show how to train MotifPrototypers from biologically identified motifs categorized according to the TRANSFAC categorization of transcription factors, and present empirical results of motif classification, motif parameter estimation and de novo motif detection using the learned profile models.
Sequence features of DNA binding sites reveal structural class of associated transcription factor
- Bioinformatics
, 2006
"... Motivation: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs re ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Motivation: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs recognize and bind their DNA counterparts according to the structure of their DNA-binding domains (e.g. zinc finger, leucine zipper, homeodomain). The structure of thesedomains can be used as a basis for grouping TFs into classes. Although the structure of DNAbinding domains varies widely across TFs generally, the TFs within a particular class bind to DNA in a similar fashion, suggesting the existence of class-specific features in the DNA sequences bound by each class of TFs. Results: In this paper, we apply a sparse Bayesian learning algorithm to identify a small set of class-specific features in the DNA sequences bound by different classes of TFs; the algorithm simultaneously learns a true multi-class classifier that uses these features to predict the DNA-binding domain of the TF that recognizes a particular set of DNA sequences. We train our algorithm on the six largest classes in TRANSFAC, comprising a total of 587 TFs. We learn a six-class classifier for this training set that achieves 87 % leave-one-out crossvalidation accuracy. We also identify features within cis-regulatory sequences that are highly specific to each class of TF, which has significant implications for how TF binding sites should be modeled for the purpose of motif discovery.
A feature-based approach to modeling protein-DNA interactions
- In Proc. RECOMB’07
, 2007
"... Abstract. Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. In many cases this simplify ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. In many cases this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF-DNA interactions, based on Markov networks. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our models, and devise an algorithm for learning their structural features from binding site data. We evaluate our approach on synthetic data, and then apply it to binding site and ChIP-chip data from yeast. We reveal sequence features that are present in the binding specificities of yeast TFs, and show that FMMs explain the binding data significantly better than PSSMs. Key words: transcription factor binding sites, DNA sequence motifs, probabilistic graphical models, Markov networks, motif finder. 1
Extracting sequence features to predict protein-DNA interactions: A comparative study
- Nucleic Acids Research
, 2008
"... Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those method ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIPbinding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines (MARS), neural networks, support vector machines, boosting, and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.
Relevance Vector Machines for classifying points and regions in biological sequences.
, 2008
"... The Relevance Vector Machine (RVM) is a recently developed machine learning framework capable of building simple models from large sets of candidate features. Here, we describe a protocol for using the RVM to explore very large numbers of candidate features, and a family of models which apply the po ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The Relevance Vector Machine (RVM) is a recently developed machine learning framework capable of building simple models from large sets of candidate features. Here, we describe a protocol for using the RVM to explore very large numbers of candidate features, and a family of models which apply the power of the RVM to classifying and detecting interesting points and regions in biological sequence data. The models described here have been used successfully for predicting transcription start sites and other features in genome sequences. 2 1
VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees
- Nucl Acids Res
, 2006
"... Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of transcription factor binding sites, and it could be demonstrated that they outperform traditional models, such as position weight matrices, Markov models and Bayesian trees. We develop a web serv ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of transcription factor binding sites, and it could be demonstrated that they outperform traditional models, such as position weight matrices, Markov models and Bayesian trees. We develop a web server for the recognition of DNA binding sites based on variable order Markov models and variable order Bayesian trees offering the following functionality: (i) given datasets with annotated binding sites and genomic background sequences, variable order Markov models and variable order Bayesian trees can be trained; (ii) given a set of trained models, putative DNA binding sites can be predicted in a given set of genomic sequences and (iii) given a dataset with annotated binding sites and a dataset with genomic background sequences, cross-validation experiments for different model combinations with different parameter settings can be performed. Several of the offered services are computationally demanding, such as genome-wide predictions of DNA binding sites in mammalian genomes or sets of 10 4-fold cross-validation experiments for different model combinations based on problem-specific data sets. In order to execute these jobs, and in order to serve multiple users at the same time, the web server is attached to a Linux cluster with 150 processors. VOMBAT is available at
Feature Based Representation and Detection of Transcription Factor Binding Sites
"... Abstract: The prediction of transcription factor binding sites is an important problem, since it reveals information about the transcriptional regulation of genes. A commonly used representation of these sites are position specific weight matrices which show weak predictive power. We introduce a fea ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: The prediction of transcription factor binding sites is an important problem, since it reveals information about the transcriptional regulation of genes. A commonly used representation of these sites are position specific weight matrices which show weak predictive power. We introduce a feature-based modelling approach, which is able to deal with various kind of biological properties of binding sites and models them via Bayesian belief networks. The presented results imply higher model accuracy in contrast to the PSSM approach.
Finding Regulatory Motifs with Maximum Density Subgraphs
"... The identification of over-represented but imperfectly conserved motifs in genomic DNA is a problem with important biological applications, such as the discovery of regulatory elements that determine the timing, location, and level of gene transcription. Experimental techniques such as ChIP-chip and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The identification of over-represented but imperfectly conserved motifs in genomic DNA is a problem with important biological applications, such as the discovery of regulatory elements that determine the timing, location, and level of gene transcription. Experimental techniques such as ChIP-chip and geneexpression
Efficient learning of Bayesian network classifiers: An
"... We introduce a Bayesian network classifier less restrictive than Naive Bayes (NB) and Tree Augmented Naive Bayes (TAN) classifiers. Considering that learning an unrestricted network is unfeasible the proposed classifier is confined to be consistent with the breadth-first search order of an optimal T ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce a Bayesian network classifier less restrictive than Naive Bayes (NB) and Tree Augmented Naive Bayes (TAN) classifiers. Considering that learning an unrestricted network is unfeasible the proposed classifier is confined to be consistent with the breadth-first search order of an optimal TAN. We propose an efficient algorithm to learn such classifiers for any score that decompose over the network structure, including the well known scores based on information theory and Bayesian scoring functions. We show that the induced classifier always scores better than or the same as the NB and TAN classifiers. Experiments on modeling transcription factor binding sites show that, in many cases, the improved scores translate into increased classification accuracy. 1

