Results 1 - 10
of
17
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different space-time ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
(Show Context)
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different space-time tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Inducing suffix and LCP arrays in external memory
- In Proc. ALENEX
, 2013
"... We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/O-volume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments). 1
Inducing the LCP-array
- In International Conference on Algorithms and Data Structures (WADS
, 2011
"... ar ..."
(Show Context)
A simple parallel cartesian tree algorithm and its application to suffix tree construction
- IN PROCEEDINGS OF THE WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS
, 2014
"... We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain s ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain string queries. By adding downward pointers in the tree (e.g. using a hash table), we can also generate suffix trees from suffix arrays on arbitrary alphabets in the same bounds. In conjunction with parallel suffix array algorithms, such as the skew algorithm, this gives a rather simple linear work parallel, O(n) time (0 < < 1), algorithm for generating suffix trees over an integer alphabet ⊆ {1,...,n}, where n is the length of the input string. It also gives a linear work parallel algorithm requiring O(log2 n) time with high probability for constant-sized alphabets. More generally, given a sorted sequence of strings and the longest common prefix lengths between adjacent elements, the algorithm will generate a patricia tree (compacted trie) over the strings. Of independent interest, we describe a work-efficient parallel algorithm for solving the all nearest smaller values problem using Cartesian trees, which is much simpler than the work-efficient parallel algorithm described in previous work. We also present experimental results comparing the performance of the algorithm to existing sequential implementations and a second parallel algorithm that we implement. We present comparisons for the
Fast and Lightweight LCP-Array Construction Algorithms
"... The suffix tree is a very important data structure in string processing, but it suffers from a huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the LCP-array, and data st ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
The suffix tree is a very important data structure in string processing, but it suffers from a huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the LCP-array, and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the lengths of the longest common prefixes of lexicographically adjacent suffixes, and it can be computed in linear time. In this paper, we present new LCP-array construction algorithms that are fast and very space efficient. In practice, our algorithms outperform the currently best algorithms on large inputs. 1
Lightweight Lempel-Ziv parsing
, 2013
"... We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.