Results 1 -
8 of
8
Incremental Paradigms of Motif Discovery
- Journal of Computational Biology
, 2004
"... We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
(Show Context)
We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n 3) time incremental discovery algorithm. Key words: 1.
A basis of tiling motifs for generating repeated patterns and its complexity for higher quorum
- In B.Rovan and P.Vojtás, editors, Mathematical Foundations of Computer Science, volume 2747 of LNCS
, 2003
"... ..."
(Show Context)
Extracting approximate patterns
, 2005
"... In this paper, we define a family of patterns with don’t cares occurring in a text. We call them primitive patterns. The set of primitive patterns forms a basis for all the maximal patterns occurring in the text. The number of primitive patterns is smaller than other known basis. We present an incre ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this paper, we define a family of patterns with don’t cares occurring in a text. We call them primitive patterns. The set of primitive patterns forms a basis for all the maximal patterns occurring in the text. The number of primitive patterns is smaller than other known basis. We present an incremental algorithm that computes the primitive patterns occurring at least q times in a text of length n, given the N primitive patterns occurring at least q − 1 times, in time O(|Σ|Nn2 log n), whereΣis the alphabet. In the particular case where q = 2, the complexity in time is only O(|Σ|n2 log n). We also give an algorithm that decides if a given pattern is primitive in a given text. These algorithms are generalized, taking many sequences in input. Finally, we give a solution for reducing the number of patterns of interest by using scoring techniques, as we show that the number of primitive patterns is exponential.
Monotone scoring of patterns with mismatches
- In Proceedings of WABI 2004
, 2004
"... Abstract. We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient met ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
Abstract. We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent queries: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k 2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for probabilities and expected occurrences of a substring under unit increases in its length, allowed number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns. 1
Incremental Discovery of the Irredundant Motif Bases for all Suffixes of a String in O(|Σ|n² log n) Time
, 2007
"... Compact bases formed by motifs called irredundant and capable of generating all other motifs in a sequence have been proposed in recent years and successfully tested in tasks of biosequence analysis and classi cation. Given a sequence s of n characters drawn from an alphabet Σ, the problem of extrac ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Compact bases formed by motifs called irredundant and capable of generating all other motifs in a sequence have been proposed in recent years and successfully tested in tasks of biosequence analysis and classi cation. Given a sequence s of n characters drawn from an alphabet Σ, the problem of extracting such a base from s had been previously solved in time O(n2 log n log |Σ|) and O(|Σ|n2 log 2 n log log n), respectively, through resort to the FFT-based string searching by Fischer and Paterson. More recently, a solution taking time O(|Σ|n2) without resort to the FFT was also proposed. In the present paper, the problem is considered of extracting the bases of all su xes of a string incrementally. This problem was solved in previous work in time O(n3). A much faster incremental algorithm is described here, which takes time O(|Σ|n2 log n). Whereas also this algorithm does not make use of the FFT, its performance is comparable to the one exhibited by the previous FFTbased algorithms computing only one base. The implicit representation of a single base requires O(n) space, whence for nite alphabets the proposed solution is within a log n factor from optimality.
Extracting Approximate Patterns (Extended Abstract)
, 2003
"... In a sequence, approximate patterns are exponential in number. In this paper, we present a new notion of basics dfor the patterns with don't cares occurring in a given text (sequence)... ..."
Abstract
- Add to MetaCart
In a sequence, approximate patterns are exponential in number. In this paper, we present a new notion of basics dfor the patterns with don't cares occurring in a given text (sequence)...
Composite Pattern Discovery for PCR Application
"... Abstract. We consider the problem of finding pairs of short patterns such that, in a given input sequence of length n, the distance between each pair’s patterns is at least α. The problem was introduced in [1] and is motivated by the optimization of multiplexed nested PCR. We study algorithms for th ..."
Abstract
- Add to MetaCart
Abstract. We consider the problem of finding pairs of short patterns such that, in a given input sequence of length n, the distance between each pair’s patterns is at least α. The problem was introduced in [1] and is motivated by the optimization of multiplexed nested PCR. We study algorithms for the following two cases; the special case when the two patterns in the pair are required to have the same length, and the more general case when the patterns can have different lengths. For the first case we present an O(αn log log n) time and O(n) space algorithm, and for the general case we give an O(αn log n) time and O(n) space algorithm. The algorithms work for any alphabet size and use asymptotically less space than the algorithms presented in [1]. For alphabets of constant size we also give an O(n √ n log 2 n) time algorithm for the general case. We demonstrate that the algorithms perform well in practice and present our findings for the human genome. In addition, we study an extended version of the problem where patterns in the pair occur at certain positions at a distance at most α, but do not occur α-close anywhere else, in the input sequence.
Signature Limits: An Entire Map of Clone Features and their Discovery in Nearly Linear Time.
"... Abstract. We address the problem of creating entire and complete maps of software code clones (copy features in data) in a corpus of binary artifacts of unknown provenance. We report on a practical methodology, which employs enhanced suffix data structures and partial orderings of clones to compute ..."
Abstract
- Add to MetaCart
Abstract. We address the problem of creating entire and complete maps of software code clones (copy features in data) in a corpus of binary artifacts of unknown provenance. We report on a practical methodology, which employs enhanced suffix data structures and partial orderings of clones to compute a compact representation of most interesting clones features in data. The enumeration of clone features is useful for malware triage and prioritization when human exploration, testing and verifica-tion is the most costly factor. We further show that the enhanced arrays may be used for discovery of provenance relations in data and we intro-duce two distinct Jaccard similarity coefficients to measure code similar-ity in binary artifacts. We illustrate the use of these tools on real malware data including a retro-diction experiment for measuring and enumerat-ing evidence supporting common provenance in Stuxnet and Duqu. The results indicate the practicality and efficacy of mapping completely the clone features in data.