#### DMCA

## Optimal string mining under frequency constraints (2006)

### Cached

### Download Links

- [cgi.bio.ifi.lmu.de]
- [www.bio.ifi.lmu.de]
- [www-ab.informatik.uni-tuebingen.de]
- [ab.inf.uni-tuebingen.de]
- [wwwkramer.in.tum.de]
- DBLP

### Other Repositories/Bibliography

Venue: | Closed Sets for Labeled Data?, PKDD, 2006 |

Citations: | 13 - 4 self |

### Citations

1126 |
Algorithms on Strings, Trees and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ... in Dj as SDj (φ) − CDj (φ). The formula for the C-numbers is given by (2), 3 Note that the maximally repeated strings are exactly those strings that correspond to an internal node in the suffix tree =-=[19]-=- for t.sAlgorithm 2: Extraction of all substrings satisfying p. Input: suffix array SA, lcp-array LCP, C ′ D j as computed by Alg. 1 (all of size n), frequency-based predicate p(supp D1 , . . . , supp... |

835 | Suffix Arrays: A New Method for On-Line String Searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...let t denote an arbitrary string of length n. Later, t will be formed from the input databases (fully explained in Sect. 3.1). Recall that ti..j is the substring from i to j. The suffix array SA (see =-=[3, 4]-=-) for t is used to describe the lexicographic order of t’s suffixes, in the sense that it “enumerates” the suffixes from the smallest to the largest. More formally, SA[1, n] is an array of integers s.... |

343 | Efficient mining of emerging patterns: Discovering trends and differences”.
- Dong, Li
- 1999
(Show Context)
Citation Context ... a string φ as growth D2→D1 (φ) := supp(φ, D1) supp(φ, D2) , if supp(φ, D2) �= 0 , and growth D2→D1 (φ) = ∞ otherwise. The following definition is motivated by the problem of mining Emerging Patterns =-=[12]-=-: Problem 2. Given two databases D1 and D2 of strings over Σ, a support threshold ρs (1/|D1| ≤ ρs ≤ 1), and a minimum growth rate ρg > 1, the Emerging Substrings Mining Problem is to find all strings ... |

207 | Replacing suffix trees with enhanced suffix arrays
- Abouelhoda
- 2004
(Show Context)
Citation Context ...maximal interval in SA where all suffixes have a common prefix (namely φ), and (c) says that at least two of the suffixes in this interval differ after position |φ|. We refer the interested reader to =-=[20]-=- for a proof of this non-trivial result. From now on, we call (l, r) an lcp-interval representing string φ if it fulfills the conditions of Lemma 3. A child-interval of (l, r) is a maximal proper sub-... |

113 | Linear-time longestcommon prefix computation in suffix arrays and its applications”,
- KASAI, LEE, et al.
- 2006
(Show Context)
Citation Context ...t SA[i]..n, t SA[i−1]..n) for all 1 < i ≤ n, and LCP[1] = 0. That is, LCP contains the lengths of the longest common prefixes of t’s suffixes that are consecutive in lexicographic order. Kasai et al. =-=[5]-=- gave an algorithm to compute LCP in O(n) time, and Manzini [16] adapted this algorithm to use only one integer array. It can be argued that most of the LCP-values are small compared with the size of ... |

102 | Efficient Linear Time Construction of Suffix Arrays, in
- Ko, Aluru, et al.
(Show Context)
Citation Context ...ch builds on the databases D1 and D2 from Ex. 1. The suffix array for t can be computed in O(n) time, either indirectly by constructing a suffix tree for t, or directly with some recent methods, e.g. =-=[13]-=-. In practice, however, asymptotically slower algorithms [14, 15] have been shown to perform faster. The method in [15] has the further advantage that it uses only ɛn additional bytes of space, which ... |

84 |
On economic construction of the transitive closure of a directed graph,
- Arlazarov, Dinic, et al.
- 1970
(Show Context)
Citation Context ...g n). Then each block is preprocessed such that a query that lies completely inside one block can be answered in constant time. This step is accomplished by applying the so-called Four-Russians-Trick =-=[17]-=- to the blocks (precomputation of all results for sufficiently small instances). A final step preprocesses the array such that queries that exactly span over several blocks can be answered efficiently... |

81 | Engineering a lightweight suffix array construction algorithm.Algorithmca
- Manzini, Ferragina
- 2004
(Show Context)
Citation Context ...rray for t can be computed in O(n) time, either indirectly by constructing a suffix tree for t, or directly with some recent methods, e.g. [13]. In practice, however, asymptotically slower algorithms =-=[14, 15]-=- have been shown to perform faster. The method in [15] has the further advantage that it uses only ɛn additional bytes of space, which is close to optimal. Here, ɛ is a tunable parameter that determin... |

78 |
Recursive star-tree parallel data structure.
- BERKMAN, VISHKIN
- 1993
(Show Context)
Citation Context ...it fulfills the conditions of Lemma 3. A child-interval of (l, r) is a maximal proper sub-interval of (l, r) that represents a different string. E.g., in Fig. 1(a), the lcp-interval representing a is =-=(6, 14)-=-, which has the child-intervals (8, 9) (representing aa) and (11, 14) (representing ab). Now, if (l, r) is the lcp-interval that represents φ, with Lemma 2 we see that � CDj (φ) = C l≤i≤r ′ Dj [i] = �... |

69 |
Color Set Size Problem with Application to String Matching
- Hui
- 1992
(Show Context)
Citation Context ...tation of the correction terms in phase 3. Although the C ′ ’s are Dj represented by new arrays of size n, we call this step “labelling” because it is derived from the tree labelling technique by Hui =-=[18]-=-. We want C ′ [i] to be equal Dj to the number of lexicographically adjacent suffixes from the same string in Dj that share a longest common prefix of length LCP[i]. More formally, C ′ [i] equals Dj t... |

44 | A theory of inductive query answering.
- Raedt, Jaeger, et al.
- 2002
(Show Context)
Citation Context ...turn all strings φ ∈ Σ ⋆ that satisfy mini ≤ freq(φ, Di) ≤ max i for all 1 ≤ i ≤ m. This well-known problem has been addressed by many authors using different solution strategies and data-structures (=-=[10, 11, 2]-=-), but none of these is optimal. Next, we consider a 2-class problem for a (usually positive) database D1 and a (usually negative) database D2. We define the growth-rate from D2 to D1 of a string φ as... |

44 | Two space saving tricks for linear time LCP computation
- Manzini
- 2004
(Show Context)
Citation Context ...at is, LCP contains the lengths of the longest common prefixes of t’s suffixes that are consecutive in lexicographic order. Kasai et al. [5] gave an algorithm to compute LCP in O(n) time, and Manzini =-=[16]-=- adapted this algorithm to use only one integer array. It can be argued that most of the LCP-values are small compared with the size of the text and could thus be stored in less than n words, but we d... |

41 |
New indices for text
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ... linear-time algorithm for answering frequencyrelated mining queries (e.g., emerging substrings). Logically, the algorithm can be divided into three main phases: (1) Preprocessing, (2) Labelling, and =-=(3)-=- Extraction. The preprocessing step constructs all necessary data structures: the suffix- and lcp-array, and the preprocessing for RMQ. The labelling step does the principal work for a fast calculatio... |

34 | Theoretical and Practical Improvements on the RMQProblem, with Applications to LCA and LCE”,
- Fischer, Heun
- 2006
(Show Context)
Citation Context ...called range minimum queries (RMQs). RMQs generalize the lcp table in the sense that the length of the longest common prefix can be answered for arbitrary suffixes. Taking advantage of recent results =-=[6, 7]-=-, it is possible to answer RMQs in constant time. Another technical novelty is the solution to computing the frequency counts. The solution first determines the number of all occurrences (counting sev... |

20 | A new representation for protein secondary structure prediction based on frequent patterns
- Birzele, Kramer
- 2006
(Show Context)
Citation Context ...tational biology, the goal is to find interesting string or sequence patterns in data. Application areas are, among others, finding discriminative features for sequence classification or segmentation =-=[1]-=-, discovering new binding motifs of transcription factors, or probe design [2]. In this paper, we focus on string mining under frequency constraints, i.e., predicates over patterns depending solely on... |

20 |
An incomplex algorithm for fast suffix array construction.
- Schurmann, Stoye
- 2005
(Show Context)
Citation Context ...rray for t can be computed in O(n) time, either indirectly by constructing a suffix tree for t, or directly with some recent methods, e.g. [13]. In practice, however, asymptotically slower algorithms =-=[14, 15]-=- have been shown to perform faster. The method in [15] has the further advantage that it uses only ɛn additional bytes of space, which is close to optimal. Here, ɛ is a tunable parameter that determin... |

18 |
De Raedt: An Efficient Algorithm for Mining String Databases Under Constraints. KDID 2004: 108-129 Haiquan Li, Jinyan Li, Limsoon Wong, Mengling Feng, Yap-Peng Tan: Relative risk and odds ratio: a data mining perspective
- Lee, Luc
- 2007
(Show Context)
Citation Context ...turn all strings φ ∈ Σ ⋆ that satisfy mini ≤ freq(φ, Di) ≤ max i for all 1 ≤ i ≤ m. This well-known problem has been addressed by many authors using different solution strategies and data-structures (=-=[10, 11, 2]-=-), but none of these is optimal. Next, we consider a 2-class problem for a (usually positive) database D1 and a (usually negative) database D2. We define the growth-rate from D2 to D1 of a string φ as... |

15 | On the complexity of finding emerging patterns.
- Wang, Zhao, et al.
- 2005
(Show Context)
Citation Context ...e total length of all such significant strings (i.e., the output size). It is interesting to note that no optimality results are known for other pattern domains such as itemsets or graphs (see, e.g., =-=[8]-=-). While the focus of this paper lies on the algorithm and the theoretical result, we also implemented and tested the approach to show that it works in practice. In our experiments, we compared protei... |

12 | Mining emerging substrings
- Chan, Kao, et al.
- 2002
(Show Context)
Citation Context ...of sequence data. The aim of the experiments was to mine all frequent substrings (Probl. 1, Sect. 2) and emerging substrings, respectively (Probl. 2). The only known algorithm for emerging substrings =-=[9]-=- runs in quadratic time, and is therefore not applicable. The experiments confirm that our approach works well in practice. In particular, most queries for emerging substrings can be answered in less ... |

8 | NEWT, a new taxonomy portal
- Phan, Pilbout, et al.
- 2003
(Show Context)
Citation Context .... We used two datasets consisting of the primary structure of all protein data from human and mouse, which were obtained from Swissprot using the keywords HUMAN and MOUSE in the NEWT taxonomy browser =-=[21]-=-. The human dataset contained 57,020 proteins of total length ≈23MB, and the mouse dataset contained 50,680 proteins of total length ≈22MB. Because the implementation of the emerging-substringminer fr... |

4 | Fast frequent string mining using suffix arrays
- Fischer, Heun, et al.
- 2005
(Show Context)
Citation Context ...in data. Application areas are, among others, finding discriminative features for sequence classification or segmentation [1], discovering new binding motifs of transcription factors, or probe design =-=[2]-=-. In this paper, we focus on string mining under frequency constraints, i.e., predicates over patterns depending solely on the frequency of their occurrence in the data. This category encompasses comb... |