Results 1 - 10
of
24
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
- ALGORITHMS FOR MOLECULAR BIOLOGY
, 2007
"... Background: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statisti ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Background: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.
Results: We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ks or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.
Method: The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m| | + K|σ|K) ∏i ki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, | | is the total number of words in motifs, K is the order of Markov model, and ki is the number of occurrences of the ith motif.
Conclusion: The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.
Availability: Project web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/
H: An Experimental Comparison of PMSPrune and Other Algorithms for Motif Search
"... ar ..."
(Show Context)
On Correlation Polynomials and Subword Complexity
"... We consider words with letters from a q-ary alphabet A. The kth subword complexity of a word w ∈ A ∗ is the number of distinct subwords of length k that appear as contiguous subwords of w. We analyze subword complexity from both combinatorial and probabilistic viewpoints. Our first main result is a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider words with letters from a q-ary alphabet A. The kth subword complexity of a word w ∈ A ∗ is the number of distinct subwords of length k that appear as contiguous subwords of w. We analyze subword complexity from both combinatorial and probabilistic viewpoints. Our first main result is a precise analysis of the expected kth subword complexity of a randomly-chosen word w ∈ A n. Our other main result describes, for w ∈ A ∗ , the degree to which one understands the set of all subwords of w, provided that one knows only the set of all subwords of some particular length k. Our methods rely upon a precise characterization of overlaps between words of length k. We use three kinds of correlation polynomials of words of length k: unweighted correlation polynomials; correlation polynomials associated to a Bernoulli source; and generalized multivariate correlation polynomials. We survey previously-known results about such polynomials, and we also present some new results concerning correlation polynomials.
The average profile of suffix trees
- In The Fourth Workshop on Analytic Algorithmics and Combinatorics
, 2007
"... The internal profile of a tree structure denotes the number of internal nodes found at a specific level of the tree. Similarly, the external profile denotes the number of leaves on a level. The profile is of great interest because of its intimate connection to many other parameters of trees. For ins ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
The internal profile of a tree structure denotes the number of internal nodes found at a specific level of the tree. Similarly, the external profile denotes the number of leaves on a level. The profile is of great interest because of its intimate connection to many other parameters of trees. For instance, the depth, fill-up level, height, path length, shortest path, and size of trees can each be interpreted in terms of the profile. The current study is motivated by the work of Park et al. [22], which was a comprehensive study of the profile of tries constructed from independent strings (also, each string generated by a memoryless source). In the present paper, however, we consider suffix trees, which are constructed from suffixes of a common string. The dependency between
Markovian embeddings of general random strings
"... Let A be a finite set and X a sequence of A-valued random variables. We do not assume any particular correlation structure between these random variables; in particular, X may be a non-Markovian sequence. An adapted embedding of X is a sequence of the form R(X1), R(X1, X2), R(X1, X2, X3), etc where ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Let A be a finite set and X a sequence of A-valued random variables. We do not assume any particular correlation structure between these random variables; in particular, X may be a non-Markovian sequence. An adapted embedding of X is a sequence of the form R(X1), R(X1, X2), R(X1, X2, X3), etc where R is a transformation defined over finite length sequences. In this extended abstract we characterize a wide class of adapted embeddings of X that result in a first-order homogeneous Markov chain. We show that any transformation R has a unique coarsest refinement R ′ in this class such that R ′ (X1), R ′ (X1, X2), R ′ (X1, X2, X3), etc is Markovian. (By refinement we mean that R ′ (u) = R ′ (v) implies R(u) = R(v), and by coarsest refinement we mean that R ′ is a deterministic function of any other refinement of R in our class of transformations.) We propose a specific embedding that we denote as R X which is particularly amenable for analyzing the occurrence of patterns described by regular expressions in X. A toy example of a non-Markovian sequence of 0’s and 1’s is analyzed thoroughly: discrete asymptotic distributions are established for the number of occurrences of a certain regular pattern in X1,..., Xn as n → ∞ whereas a Gaussian asymptotic distribution is shown to apply for another regular pattern.
DNA motif elucidation using belief propagation
- Nucleic Acids Res
, 2013
"... Protein-binding microarray (PBM) is a high-through-out platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 10); such com-prehensive ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Protein-binding microarray (PBM) is a high-through-out platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 10); such com-prehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif represen-tation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best per-formance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors’ websites: e.g.
LARGE DEVIATIONS AND FULL EDGEWORTH EXPANSIONS FOR FINITE MARKOV CHAINS WITH APPLICATIONS TO THE ANALYSIS OF GENOMIC SEQUENCES
, 2009
"... Abstract. To establish lists of words with unexpected frequencies in long sequences, for instance in a molecular biology context, one needs to quantify the exceptionality of families of word frequencies in random sequences. To this aim, we study large deviation probabilities of multidimensional word ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. To establish lists of words with unexpected frequencies in long sequences, for instance in a molecular biology context, one needs to quantify the exceptionality of families of word frequencies in random sequences. To this aim, we study large deviation probabilities of multidimensional word counts for Markov and hidden Markov models. More specifically, we compute local Edgeworth expansions of arbitrary degrees for multivariate partial sums of lattice valued functionals of finite Markov chains. This yields sharp approximations of the associated large deviation probabilities. We also provide detailed simulations. These exhibit in particular previously unreported periodic oscillations, for which we provide theoretical explanations.
Exploring String Patterns with Trees
, 2011
"... The combination of the fields of probability and combinatorics is currently an object of much research. However, not many undergraduates or lay people have the opportunity to see how these areas can work together. We present what we hope is an accessible introduction to the possibilities easily ava ..."
Abstract
- Add to MetaCart
The combination of the fields of probability and combinatorics is currently an object of much research. However, not many undergraduates or lay people have the opportunity to see how these areas can work together. We present what we hope is an accessible introduction to the possibilities easily available to many more people through the use of many examples and understandable explanations. We introduce topics of generating functions and tree structures formed through both independent strings and suffixes, as well as how we can find correlation polynomials, expected values, second moments, and variances of the number of nodes in a tree using indicator functions. Then we show a higher order example that includes matrices as the basis for its generating functions. The study of this unique field has many applications in areas including data compression, computational biology with the human genome, and computer science with binary strings.