Results 1 - 10
of
30
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more exp ..."
Abstract
-
Cited by 119 (6 self)
- Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
A Database Index to Large Biological Sequences
- In VLDB
, 2001
"... We present an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure deployed on the general purpose persistent Java platform, PJama. Our implementation technique is novel, in that it allows us to build suffix trees on disk for arbitrarily large sequences, ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
We present an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure deployed on the general purpose persistent Java platform, PJama. Our implementation technique is novel, in that it allows us to build suffix trees on disk for arbitrarily large sequences, for instance for the longest human chromosome consisting of 263 million letters. We propose to use such indexes as an alternative to the current practice of serial scanning. We describe our tree creation algorithm, analyse the performance of our index, and discuss the interplay of the data structure with object store architectures. Early measurements are presented.
Compressed suffix trees with full functionality
- Theory of Computing Systems
"... We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1
Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
- Journal of Computational Biology
, 2000
"... Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xed-memory Markov models. In practice, such automa ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, tree-shaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suf � x Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning 2 the automaton from a given training set of sequences requires worst-case time, where is the total length of the sequences in and is the length of a longest substring of to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of characters may cost time 2 in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any, takes time. Prediction of a string of symbols by the automaton takes time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related ef � cient computation, which is a by-product possibly of more general interest. Key words: amnesic automata, probabilistic suf � x trees, variable memory Markovian models, protein families, protein classi � cation. 1
Fast lightweight suffix array construction and checking
- 14th Annual Symposium on Combinatorial Pattern Matching
, 2003
"... We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p
Finding maximal pairs with bounded gap
- Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1645 of Lecture Notes in Computer Science
, 1999
"... A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this pape ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this paper we present methods for finding all maximal pairs under various constraints on the gap. In a string of length n we can find all maximal pairs with gap in an upper and lower bounded interval in time O(n log n + z) where z is the number of reported pairs. If the upper bound is removed the time reduces to O(n+z). Since a tandem repeat is a pair where the gap is zero, our methods can be seen as a generalization of finding tandem repeats. The running time of our methods equals the running time of well known methods for finding tandem repeats.
On-Line Construction of Compact Directed Acyclic Word Graphs
, 2001
"... A Compact Directed Acyclic Word Graph (CDAWG) is a space-efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on-line algorithm for its construction, as well as the construction ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
A Compact Directed Acyclic Word Graph (CDAWG) is a space-efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on-line algorithm for its construction, as well as the construction of a CDAWG for a set of strings.
Discovering best variable-length-don’t-care patterns
- IN: PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DISCOVERY SCIENCE. VOLUME 2534 OF LNAI., SPRINGER-VERLAG
, 2002
"... A variable-length-don’t-care pattern (VLDC pattern) isanelement of set Π =(Σ ∪{⋆}) ∗ , where Σ is an alphabet and ⋆ is a wildcard matching any string in Σ ∗. Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the o ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
A variable-length-don’t-care pattern (VLDC pattern) isanelement of set Π =(Σ ∪{⋆}) ∗ , where Σ is an alphabet and ⋆ is a wildcard matching any string in Σ ∗. Given two sets of strings, we consider the problem of finding the VLDC pattern that is the most common to one, and the least common to the other. We present a practical algorithm to find such best VLDC patterns exactly, powerfully sped up by pruning heuristics. We introduce two versions of our algorithm: one employs a pattern matching machine (PMM) whereas the other does an index structure called the Wildcard Directed Acyclic Word Graph (WDAWG). In addition, we consider a more generalized problem of finding the best pair 〈q, k〉, where k is the window size that specifies the length of an occurrence of the VLDC pattern q matching a string w. We present three algorithms solving this problem with pruning heuristics, using the dynamic programming (DP), PMMs and WDAWGs, respectively. Although the two problems are NP-hard, we experimentally show that our algorithms run remarkably fast.
Solving the string statistics problem in time O(n log n)
- Proc. 29th International Colloquium on Automata, Languages, and Programming
, 2002
"... The string statistics problem consists of preprocessing a string of length n such that given a query pattern of length m, the maximum number of non-overlapping occurrences of the query pattern in the string can be reported efficiently... ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The string statistics problem consists of preprocessing a string of length n such that given a query pattern of length m, the maximum number of non-overlapping occurrences of the query pattern in the string can be reported efficiently...

