Results 1  10
of
17
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Inducing suffix and LCP arrays in external memory
 In Proc. ALENEX
, 2013
"... We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/Ovolume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments). 1
Inducing the LCParray
 In International Conference on Algorithms and Data Structures (WADS
, 2011
"... ar ..."
(Show Context)
A simple parallel cartesian tree algorithm and its application to suffix tree construction
 IN PROCEEDINGS OF THE WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS
, 2014
"... We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottomup traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain s ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottomup traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain string queries. By adding downward pointers in the tree (e.g. using a hash table), we can also generate suffix trees from suffix arrays on arbitrary alphabets in the same bounds. In conjunction with parallel suffix array algorithms, such as the skew algorithm, this gives a rather simple linear work parallel, O(n) time (0 < < 1), algorithm for generating suffix trees over an integer alphabet ⊆ {1,...,n}, where n is the length of the input string. It also gives a linear work parallel algorithm requiring O(log2 n) time with high probability for constantsized alphabets. More generally, given a sorted sequence of strings and the longest common prefix lengths between adjacent elements, the algorithm will generate a patricia tree (compacted trie) over the strings. Of independent interest, we describe a workefficient parallel algorithm for solving the all nearest smaller values problem using Cartesian trees, which is much simpler than the workefficient parallel algorithm described in previous work. We also present experimental results comparing the performance of the algorithm to existing sequential implementations and a second parallel algorithm that we implement. We present comparisons for the
Fast and Lightweight LCPArray Construction Algorithms
"... The suffix tree is a very important data structure in string processing, but it suffers from a huge space consumption. In largescale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the LCParray, and data st ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
The suffix tree is a very important data structure in string processing, but it suffers from a huge space consumption. In largescale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the LCParray, and data structures for simulating navigational operations on the suffix tree. The LCParray stores the lengths of the longest common prefixes of lexicographically adjacent suffixes, and it can be computed in linear time. In this paper, we present new LCParray construction algorithms that are fast and very space efficient. In practice, our algorithms outperform the currently best algorithms on large inputs. 1
Lightweight LempelZiv parsing
, 2013
"... We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.