## Modeling Dependencies in Protein-DNA Binding Sites (2003)

### Cached

### Download Links

Citations: | 78 - 2 self |

### BibTeX

@INPROCEEDINGS{Barash03modelingdependencies,

author = {Yoseph Barash and Gal Elidan and Nir Friedman and Tommy Kaplan},

title = {Modeling Dependencies in Protein-DNA Binding Sites},

booktitle = {},

year = {2003},

pages = {28--37},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation of transcription factor binding sites is aposition specific score matrix (PSSM). This representation makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that these richer representations improve over the PSSM model in both tasks.

### Citations

8080 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Lairdsand, et al.
- 1977
(Show Context)
Citation Context ...cording to a background distribution. In models where we have a hidden variable T, parameter estimation is somewhat more complex. We need to perform an iterative procedure of Expectation Maximization =-=[13, 28]-=- to find a (local) maximum of the likelihood function. Structure Learning In addition to estimating parameters, we might also want to learn the dependency structure G, i.e., which edges to include. Wh... |

2532 |
An introduction to the bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ... challenge is how to relate these dependencies to protein structure and function. For this purpose, we need to be able to estimate our confidence in the discovered dependencies (e.g., using bootstrap =-=[14, 19]-=- or Bayesian methods [20]) and relate these dependencies with three dimensional conformations of Protein-DNA complexes. Acknowledgments We thank Doug Brutlag, Hillel Fleischer, Hanah Margalit, Tomer N... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...learning in this case reduces to estimating marginal probability from/5. Since we usually have a small number of training examples, we smooth the maximum likelihood estimates by using Dirichletpriors =-=[23]-=-. This amounts to adding a small number (5 in our experiments) of pseudo instances that are distributed according to a background distribution. In models where we have a hidden variable T, parameter e... |

854 | A tutorial on learning with bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]. -=-We assume we have a training dataset D of M aligned binding sites. We denote by xi[m] the value of Xi at the m’th example. To clarify the discussion, it is conceptually easier to think of the input ... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...g dependencies, while limiting the number of parameters to be at most 3 · 4K. Another important benefit of this class of models is that there are efficient algorithms to learn the best tree structure=-= [12, 17]-=-. Mixture of Trees In some cases, a tree structured network might be too limited. One possible approach of enriching the representation is to combine the benefits of a tree structure with the added ri... |

588 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...g dependencies, while limiting the number of parameters to be at most 3s4/45. Another important benefit of this class of models is that there are efficient algorithms to learn the best tree structure =-=[12, 17]-=-. Mixture of Trees In some cases, a tree structured network might be too limited. One possible approach of enriching the representation is to combine the benefits of a tree structure with the added ri... |

526 | Fitting a mixture model by expectation maximization to discover motifs in biopolymers
- Bailey, Elkan
- 1994
(Show Context)
Citation Context ...re co-expressed. In this case, the discovered motif indicates a possibly unknown factor that regulates the set of genes. Many works in recent years have proposed different schemes to handle this task =-=[3, 4, 24, 31, 39, 40]-=-. Both tasks require us to describe a motif that characterizes sequences that appear at binding sites of the transcription factor. The biological literature suggests that the relevant sequences are re... |

515 |
Systematic determination of genetic network architecture
- Tavazoie, Hughes, et al.
- 1999
(Show Context)
Citation Context ...re co-expressed. In this case, the discovered motif indicates a possibly unknown factor that regulates the set of genes. Many works in recent years have proposed different schemes to handle this task =-=[3, 4, 24, 31, 39, 40]-=-. Both tasks require us to describe a motif that characterizes sequences that appear at binding sites of the transcription factor. The biological literature suggests that the relevant sequences are re... |

441 |
Transcriptional regulatory networks in Saccharomyces cerevisiae
- Lee, Rinaldi, et al.
- 2002
(Show Context)
Citation Context ...els learned are more precise in predicting putative binding sites (in the sense of achieving a better false positives vs. false negatives tradeoff) using genome-wide S. cerevisiae localization assays =-=[29]-=-. 2. MODELING BINDING SITE MOTIFS We now consider how to model a sequence motif representing the binding sites of a transcription factor. We want to represent the commonalities among different binding... |

300 |
Genomewide location and function of DNA binding proteins
- Ren, Robert, et al.
- 2000
(Show Context)
Citation Context ...ill appear in the co-regulated cluster. Similarly, we want the distribution to reflect that few non-regulated genes will appear in the cluster. A more interesting case involves ChIP localization data =-=[29, 35, 38]-=-. In this case the observation is a p-value that the sequence is enriched in the immnuprecipitation assay. A significant localization p-value is an indication that the sequence is bound by the assay's... |

274 |
Identifying DNA and protein patterns with statistically significant alignments of multiple sequences
- Hertz, Stormo
- 1999
(Show Context)
Citation Context ...re co-expressed. In this case, the discovered motif indicates a possibly unknown factor that regulates the set of genes. Many works in recent years have proposed different schemes to handle this task =-=[3, 4, 24, 31, 39, 40]-=-. Both tasks require us to describe a motif that characterizes sequences that appear at binding sites of the transcription factor. The biological literature suggests that the relevant sequences are re... |

272 |
MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide data
- Quandt, Frech, et al.
- 1995
(Show Context)
Citation Context ...a known transcription factor on a genomic scale. Here one uses examples of biologically verified binding sites and aims to find similar sites in other intergenic regions such as gene promoter regions =-=[2, 34]-=-. The second task is to discover a sequence motif as well as its putative sites in a collection of relatively long intergenic sequences that are suspected of being bound by the same factor. An example... |

246 | Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of coexpressed genes
- Liu, Brutlag, et al.
- 2001
(Show Context)
Citation Context |

239 |
Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-‐genome mRNA quantitation. Nature Biotechnology
- Roth, Hughes, et al.
- 1998
(Show Context)
Citation Context ...lected by the Church lab [25, 39]. These clusters of genes are based on functional annotations, co-expression, and known targets of transcription factors. They were originally analyzed using AlignACE =-=[36]-=-. This analysis included multiple runs of AlignACE, followed by filtering based on the quality of the motifs found. The best PSSMs were reported for each cluster. To gauge the quality of our baseline ... |

234 | Learning bayesian networks with local structure
- Friedman, Goldszmidt
- 1998
(Show Context)
Citation Context ...f models as well as general unrestricted models. This can include representational extensions that are geared toward the complexity vs. expressiveness issue such as context specific dependency models =-=[6, 18]-=-. Second, as our framework made no particular assumption on the type of binding sites, it can be readily adapted to discover other sequence motifs such as those of splicing and histone remodeling fact... |

220 | The Bayesian Structural EM Algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]. -=-We assume we have a training dataset D of M aligned binding sites. We denote by xi[m] the value of Xi at the m’th example. To clarify the discussion, it is conceptually easier to think of the input ... |

216 |
The EM algorithm for graphical association models with missing data’, Computational Statistics and Analysis
- Lauritzen
- 1995
(Show Context)
Citation Context ...cording to a background distribution. In models where we have a hidden variable T, parameter estimation is somewhat more complex. We need to perform an iterative procedure of Expectation Maximization =-=[13, 28]-=- to find a (local) maximum of the likelihood function. Structure Learning In addition to estimating parameters, we might also want to learn the dependency structure G, i.e., which edges to include. Wh... |

210 | Finding motifs using random projections
- Buhler, Tompa
- 2002
(Show Context)
Citation Context ...our leing favors discfiminative motifs, we use a simple d efficient vant of the gofithm described by Bash et al [4]. This gofithm uses rdom projections of subsequences, as described by Buhler d Tompa =-=[8]-=-. Having chosen a rdom projection, we check whether it appes in each input sequence. Each of the projected K-mers is then scored by a hypergeometfic p-vue for enrichment in sequences with P(rsI O) > 0... |

175 | Efficient approximations for the marginal likelihood of bayesian networks with hidden variables - Chickering, Heckerman - 1997 |

119 | Learning belief networks in the presence of missing values and hidden variables
- Friedman
- 1997
(Show Context)
Citation Context ... it convergences to a (local) maximum. This procedure is a form of hill climbing and is guaranteed to improve the likelihood at each iteration. The Structural Expectation Maximization (SEM) algorithm =-=[15]-=- generalizes this idea when we also learn structure. For models where the structure is fixed and we learn using maximum likelihood (PSSMs and mixtures of PSSMS), we define EM as progressing through a ... |

115 | Combining evidence using p-values: application to sequence homology searches, Bioinformatics
- Bailey, Gribskov
- 1998
(Show Context)
Citation Context ...a known transcription factor on a genomic scale. Here one uses examples of biologically verified binding sites and aims to find similar sites in other intergenic regions such as gene promoter regions =-=[2, 34]-=-. The second task is to discover a sequence motif as well as its putative sites in a collection of relatively long intergenic sequences that are suspected of being bound by the same factor. An example... |

97 | Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 106: 697–708
- Simon, Barnett, et al.
- 2001
(Show Context)
Citation Context ...ill appear in the co-regulated cluster. Similarly, we want the distribution to reflect that few non-regulated genes will appear in the cluster. A more interesting case involves ChIP localization data =-=[29, 35, 38]-=-. In this case the observation is a p-value that the sequence is enriched in the immnuprecipitation assay. A significant localization p-value is an indication that the sequence is bound by the assay's... |

84 | Nucleotides of transcription factor binding sites exert inter-dependent effects on the binding af®nities of transcription factors
- Bulyk, Johnson, et al.
- 2002
(Show Context)
Citation Context ...ent of each other. It is an open question whether this strong independence assumption is reasonable. Recent results indicate that in specific cases, there might be dependence between positions (e.g., =-=[1, 7, 9]-=-). In this paper, we take a pragmatic approach to this issue. We aim to test whether modeling dependencies leads to better performance in the computational tasks of binding site annotation and motif d... |

70 |
A computational analysis of sequence features involved in recognition of short introns
- Lim, Burge
- 2001
(Show Context)
Citation Context ...tifs. Agarwal and Bafna [1], suggested the tree network model, and discussed algorithms for learning it. In a related problem of modeling splice junctions, recent works examined k-order Markov models =-=[30]-=- and tree Bayesian networks [10]. These works learned models from aligned binding sites and used them to detect splice junctions in new sequences. Finally, Bayesian networks were used to model depende... |

62 |
et al. Transcriptional regulatory networks
- Lee, Rinaldi
- 2002
(Show Context)
Citation Context ...els learned are more precise in predicting putative binding sites (in the sense of achieving a better false positives vs. false negatives tradeoff) using genome-wide S. cerevisiae localization assays =-=[29]-=-. 2. MODELING BINDING SITE MOTIFS We now consider how to model a sequence motif representing the binding sites of a transcription factor. We want to represent the commonalities among different binding... |

56 | From promoter sequence to expression: a probabilistic framework
- Segal, Barash, et al.
- 2002
(Show Context)
Citation Context ...indication that the sequence is bound by the assay's target transcription factor. To model the dependence of the localization p-value on the R. attribute, we use the noisy sensor model of Segal et al =-=[37]-=-. This model encodes that when the p-value is small, it is most likely generated by a regulated sequence. As the p-value grows, the probability given r t decays exponentially, and when the p-value is ... |

49 | Data analysis with Bayesian networks: A bootstrap approach
- Friedman, Goldszmidt, et al.
- 1999
(Show Context)
Citation Context ... challenge is how to relate these dependencies to protein structure and function. For this purpose, we need to be able to estimate our confidence in the discovered dependencies (e.g., using bootstrap =-=[14, 19]-=- or Bayesian methods [20]) and relate these dependencies with three dimensional conformations of Protein-DNA complexes. Acknowledgments We thank Doug Brutlag, Hillel Fleischer, Hanah Margalit, Tomer N... |

49 | Context-specific bayesian clustering for gene expression data
- Barash, Friedman
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]. -=-We assume we have a training dataset D of M aligned binding sites. We denote by xi[m] the value of Xi at the m’th example. To clarify the discussion, it is conceptually easier to think of the input ... |

47 | A simple hypergeometric approach for discovering putative transcription factor binding sites
- Barash, Bejerano, et al.
- 2001
(Show Context)
Citation Context |

46 | Ukonnen E: Mining for putative regulatory elements in the yeast genome using gene expression data
- Vilo, Brazma, et al.
- 2000
(Show Context)
Citation Context |

32 |
Modeling splice sites with Bayes networks
- Cai, Delcher, et al.
- 2000
(Show Context)
Citation Context ...gested the tree network model, and discussed algorithms for learning it. In a related problem of modeling splice junctions, recent works examined k-order Markov models [30] and tree Bayesian networks =-=[10]-=-. These works learned models from aligned binding sites and used them to detect splice junctions in new sequences. Finally, Bayesian networks were used to model dependencies between positions in prote... |

28 |
Eukaryotic transcription factors
- Latchman
- 1990
(Show Context)
Citation Context ...actors modulate the expression of genes by binding to specific positions in nearby genomic regions. Transcription factors bind to specific DNA subsequences that can be pinpointed by biological assays =-=[27]-=-. Indeed, the TRANSFAC database *Contact author: nir@cs.huji.ac.il Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided t... |

26 | Estimating dependency structure as a hidden variable
- Meila, Jordan, et al.
- 1997
(Show Context)
Citation Context ...e multiplying the number of parameters only by a factor of C'. An important advantage of mixture of trees is that, similarly to trees, there exist efficient algorithms for learning the best structure =-=[17, 32]-=-. 3. LEARNING MOTIF MODELS 3.1 The Learning Setup Suppose we want to learn motif models from data. We assume that our input is a set of aligned binding sites of the transcription factor. Our task is t... |

26 |
et al.: The TRANSFAC system on gene expression regulation
- Wingender, Chen, et al.
(Show Context)
Citation Context ...sh, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RECOMB '03, April 10 13, 2003, Berlin, Germany. Copyright 2003 ACM 1-58113-635-8/03/0004 ...$5.00. =-=[41]-=- contains hundreds of biologically validated binding sites. Such assays, however, are labor intensive and cannot identify all the binding sites of a transcription factor. The recent availability of co... |

24 |
Being bayesian about bayesian network structure: A bayesian approach to structure discovery in bayesian networks
- Friedman, Koller
(Show Context)
Citation Context ...hese dependencies to protein structure and function. For this purpose, we need to be able to estimate our confidence in the discovered dependencies (e.g., using bootstrap [14, 19] or Bayesian methods =-=[20]-=-) and relate these dependencies with three dimensional conformations of Protein-DNA complexes. Acknowledgments We thank Doug Brutlag, Hillel Fleischer, Hanah Margalit, Tomer Naveh, Dana Pe' er, Iramar... |

13 |
SAMIE: Statistical algorithm for modeling interaction energies
- Benos, Lapedes, et al.
- 2001
(Show Context)
Citation Context ...ent of each other. It is an open question whether this strong independence assumption is reasonable. Recent results indicate that in specific cases, there might be dependence between positions (e.g., =-=[1, 7, 9]-=-). In this paper, we take a pragmatic approach to this issue. We aim to test whether modeling dependencies leads to better performance in the computational tasks of binding site annotation and motif d... |

13 |
Jaakkola T, et al.: Serial regulation of transcriptional regulators in the yeast cell cycle
- Simon, Barnett, et al.
(Show Context)
Citation Context ...ill appear in the co-regulated cluster. Similarly, we want the distribution to reflect that few non-regulated genes will appear in the cluster. A more interesting case involves ChIP localization data =-=[29, 35, 38]. -=-In this case the observation is a p-value that the sequence is enriched in the immnuprecipitation assay. A significant localization p-value is an indication that the sequence is bound by the assay’s... |

12 |
et al., “Genome-wide location and function of DNA binding proteins
- Ren, Robert, et al.
- 2000
(Show Context)
Citation Context ...ill appear in the co-regulated cluster. Similarly, we want the distribution to reflect that few non-regulated genes will appear in the cluster. A more interesting case involves ChIP localization data =-=[29, 35, 38]. -=-In this case the observation is a p-value that the sequence is enriched in the immnuprecipitation assay. A significant localization p-value is an indication that the sequence is bound by the assay’s... |

4 |
Detecting non-adjacent correlations within signals
- Agarwal, Bafna
- 1998
(Show Context)
Citation Context ...ent of each other. It is an open question whether this strong independence assumption is reasonable. Recent results indicate that in specific cases, there might be dependence between positions (e.g., =-=[1, 7, 9]-=-). In this paper, we take a pragmatic approach to this issue. We aim to test whether modeling dependencies leads to better performance in the computational tasks of binding site annotation and motif d... |

2 |
Computational identification of cis-regulatury elements associated with groups of functional related genes in saccharomyces cerevisiae
- Hughes, Estep, et al.
- 2000
(Show Context)
Citation Context ...retrieved out of all positive ones), Specificity (% of true positives retrieved), and the significance of the retrieved sequences, according to the hypergeometric p-value (called specificity score by =-=[25]-=-). This score is the probability of retrieving at least that many positive sequences in a random set of sequences of the same size. As we can see, when the data does not centAn false positives sequenc... |

1 |
andV Bafna. Detecting non-adjacent correlations within signals
- Agarwal
- 1998
(Show Context)
Citation Context |

1 | Supplementary information for "modeling dependencies in protein-DNA binding sites". http ://www.cs .huji.ac.il/lab s/co mpbio/TFBN - Barash, Elidan, et al. |

1 |
andN Friechnan. Context-specific Bayesian clustering for gene expression data
- Barash
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]-=-. We assume we have a training dataset T) of M aligned binding sites. We denote by :ri[m] the value of Xi at the m'th example. To clarify the discussion, it is conceptually easier to think of the inpu... |

1 |
PL Jhnsn' and GM Church' Nucletides f transcriptin factr binding sites exert interdependent effects on the binding affinities of transcription factors
- Bulyk'
(Show Context)
Citation Context |

1 |
A Delcher' B Ka' and S Kasif' Mdeling splice sites with Bayes networks
- Cai'
(Show Context)
Citation Context ...gested the tree network model, and discussed algorithms for learning it. In a related problem of modeling splice junctions, recent works examined k-order Markov models [30] and tree Bayesian networks =-=[10]-=-. These works learned models from aligned binding sites and used them to detect splice junctions in new sequences. Finally, Bayesian networks were used to model dependencies between positions in prote... |

1 |
Leanting belief netwrks in the presence f missing values and hidden variables
- Friechnan'
- 1997
(Show Context)
Citation Context ... it convergences to a (local) maximum. This procedure is a form of hill climbing and is guaranteed to improve the likelihood at each iteration. The Structural Expectation Maximization (SEM) algorithm =-=[15]-=- generalizes this idea when we also learn structure. For models where the structure is fixed and we learn using maximum likelihood (PSSMs and mixtures of PSSMS), we define EM as progressing through a ... |

1 |
The Bayesian structural EM algorithm
- Friechnan
- 1998
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]-=-. We assume we have a training dataset T) of M aligned binding sites. We denote by :ri[m] the value of Xi at the m'th example. To clarify the discussion, it is conceptually easier to think of the inpu... |

1 |
A tutorial on leanting with Bayesian networks
- Heckerman
- 1998
(Show Context)
Citation Context ...quences. This is an instance of the well studied problem of learning Bayesian networks from data . We sketch the main issues without going into details. The interested reader can find more details in =-=[6, 16, 17, 22]-=-. We assume we have a training dataset T) of M aligned binding sites. We denote by :ri[m] the value of Xi at the m'th example. To clarify the discussion, it is conceptually easier to think of the inpu... |

1 | Discovering structural correlations in o-Helices - Klingler, Bratlag - 1994 |

1 |
Bioprospector: discovering conserved cha motifs in upstream regulatory regions of co-expressed genes
- Liu, Bratlag, et al.
(Show Context)
Citation Context |