Results 1  10
of
22
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more exp ..."
Abstract

Cited by 149 (6 self)
 Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple lineartime construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a spaceefficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREWPRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
STXXL: Standard template library for XXL data sets
 In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and realworld inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
Improving Suffix Array Locality for Fast Pattern Matching on Disk
, 2008
"... The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that queryi ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix arraylike blocked data structure allows queries to be answered as much as three times faster than the best alternative diskbased suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
Fast frequent string mining using suffix arrays
 IN: PROC. ICDM, IEEE COMPUTER SOCIETY
, 2005
"... ..."
TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOMESCALE SEQUENCES USING SUFFIX TREES ∗
"... With advances in highthroughput sequencing methods, and the corresponding exponential growth in sequence data, it has become critical to develop scalable data management techniques for sequence storage, retrieval and analysis. In this paper we present a novel diskbased suffix tree approach, called ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
With advances in highthroughput sequencing methods, and the corresponding exponential growth in sequence data, it has become critical to develop scalable data management techniques for sequence storage, retrieval and analysis. In this paper we present a novel diskbased suffix tree approach, called Trellis+, that effectively scales to massive amount of sequence data using only a limited amount of mainmemory, based on a novel string buffering strategy. We show experimentally that Trellis+ outperforms existing suffix tree approaches; it is able to index genomescale sequences (e.g., the entire Human genome), and it also allows rapid query processing over the diskbased index. Availability: TRELLIS+ source code is available online at
Building a parallel pipelined external memory algorithm library
 In 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS
, 2009
"... Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot ke ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot keep up with the increasing bandwidth of parallel disks, making many algorithms actually computebound. To overcome this steadily worsening limitation, we exploit today’s multicore processors with two new approaches. First, we parallelize the internal computation of the encapsulated external memory algorithms by utilizing the MCSTL library. Second, we augment the unique pipelining feature of the STXXL, to enable automatic task parallelization. We show using synthetic and practical use cases that the combination of both techniques increases performance greatly. 1
Permuted longestcommonprefix array
 In Proc. 20th CPM, LNCS 5577
, 2009
"... Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithm ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values. 1
Spaceefficient construction of LempelZiv compressed text indexes
, 2009
"... Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memo ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. A compressed fulltext selfindex is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZindex [G. Navarro, Journal of Discrete Algorithms, 2004] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[1..u] over an alphabet of size σ, the LZindex requires 4uHk(T) + o(u log σ) bits of space, where Hk(T) is the kth order empirical entropy of T. Although in practice the LZindex needs 1.01.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability only to not so large texts. In this paper we present an spaceefficient algorithm to construct the LZindex in O(u(log σ + log log u)) time and requiring 4uHk(T)+o(ulog σ) bits of space. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and outperforming by far the construction time of other compressed indexes. We also adapt our algorithm to construct some recent reduced versions of the LZindex, showing that these can also be built without using extra space on top of that required by the final index. We study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index). We show how to build all the LZindex alternatives in
ASIndex: A Structure For String Search Using ngrams and Algebraic Signatures ABSTRACT
"... ASIndex is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether baesd on trees or tries. It typically indexes every ngram in the database, though nondense indexing is possible. The hash function uses the algebraic signatures ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
ASIndex is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether baesd on trees or tries. It typically indexes every ngram in the database, though nondense indexing is possible. The hash function uses the algebraic signatures of ngrams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures whose search cost is at best logarithmic. The storage overhead of ASIndex is basically 500 600%, similar to that of alternatives or smaller. We show the index structure, our use of algebraic signatures and the search algorithm. We present the theoretical and experimental performance analysis. We compare the ASIndex to main alternatives. We conclude that ASIndex is an attractive structure and we indicate directions for future work.
Lightweight data indexing and compression in external memory
 In Proc. 8th Latin American Symposium on Theoretical Informatics (LATIN
, 2010
"... Abstract. In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previou ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of disk working space while all previous approaches use Θ(n log n) bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scanbased algorithm for inverting the BWT that uses Θ(n) bits of working space, and a lightweight internalmemory algorithm for computing the BWT which is the fastest in the literature when the available working space is o(n) bits. Finally, we prove lower bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internalmemory space × number of passes over the disk data. 1