Results 1  10
of
14
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more exp ..."
Abstract

Cited by 149 (6 self)
 Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple lineartime construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a spaceefficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREWPRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 39 (10 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Two space saving tricks for linear time LCP computation
, 2004
"... Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the inpu ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the input (the text and the suffix array) and the output (the LCP array). Thus, for a text of length n, we reduce the space occupancy of this algorithm from 13n bytes to 9n bytes. We also consider the problem of computing the LCP array by “overwriting” the suffix array. For this problem we propose an algorithm whose space occupancy can be bounded in terms of the empirical entropy of the input text. Experiments show that for linguistic texts our algorithm uses roughly 7n bytes. Our algorithm makes use of the BurrowsWheeler Transform even if it does not represent any data in compressed form. To our knowledge this is the first application of the BurrowsWheeler Transform outside the domain of data compression. The source code for the algorithms described in this paper has been included in the lightweight suffix sorting package [13] which is freely available under the GNU GPL. 1
Better external memory suffix array construction
 In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
Picky: oligo microarray design for large genomes
 Bioinformatics
, 2004
"... *To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many exi ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
*To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many existing tools operate in a batch mode that does not assure best designs. Results: PICKY is an efficient oligo microarray design tool for large genomes. PICKY integrates novel computer science techniques and the best known nearestneighbor parameters to quickly identify sequence similarities and estimate their hybridization properties. Oligos designed by PICKY are computationally optimized to guarantee the best specificity, sensitivity and uniformity under the given design constrains. PICKY can be used to design arrays for whole genomes, or for only a subset of genes. The latter can still be screened against a whole genome to attain the same quality as a whole genome array, thereby permitting low budget, pathwayspecific experiments to be conducted with large genomes. PICKY is the fastest oligo array design tool currently available to the public, requiring only a few hours to process large gene sets from rice, maize or human. Availability: PICKY is independent of any external software to execute, is designed for nonprogrammers to easily operate through a graphical user interface, and is made available for all major computing platforms (e.g., Mac, Windows and Linux) at
Fast BWT in small space by blockwise suffix sorting
 In Proc. DIMACS Working Group on the BurrowsWheeler Transform: Ten Years Later
"... The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with spaceefficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with spaceefficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text that can be handled in one piece, which is crucial for constructing compressed text indexes [4, 5]. Typically, the suffix array needs 4n bytes while the text and the BWT need only n bytes each and sometimes even less, for example 2n bits each for a DNA sequence. We reduce the space dramatically by constructing the suffix array in blocks of lexicographically consecutive suffixes. Given such a block, the corresponding block of the BWT is trivial to compute. Theorem 1 The BWT of a text of length n can be computed in O(n log n+n √ v +Dv) time (with high probability) and O(n / √ v + v) space (in addition to the text and the BWT), for any v ∈ [1, n]. Here Dv = ∑ i∈[0,n) min(di, v) = O(nv), where di is the length of the shortest unique substring starting at i. Proof (sketch). Assume first that the text has no repetitions longer than v, i.e., di ≤ v for all i. Choose a set of O(v) random suffixes that divide the suffix array into blocks. The sizes of the blocks
The engineering of a compression boosting library: Theory vs practice in BWT compression
 In Proc. 14th European Symposium on Algorithms (ESA ’06
, 2006
"... Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to des ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, MovetoFront Encoding is generally believed to be an “inefficient ” part of the BurrowsWheeler compression process. However, only recently two theoretically superior alternatives to MovetoFront have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how MovetoFront behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and MovetoFront yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting orderzero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, MovetoFront, Wavelet Trees, and Boosters can all be bypassed by a fast learner.
The Average Common Substring Approach to Phylogenomic Reconstruction
, 2005
"... We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of maximum common substrings. It is intrinsically related to information theoretic tools (KullbackLeibler relative entropy). We present an algorithm for efficiently computing these distances. In principle, the distance of two ℓ long sequences can be calculated in O(ℓ) time. We implemented the algorithm, using suffix arrays. The implementation is fast enough to enable the construction of the proteome phylogenomic tree for hundreds of species, and the genome phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with “acceptable phylogenetic and taxonomic truth”. To assess our approach, it was compared to the traditional (single gene or protein based) maximum likelihood method. It was compared to implementations of a number of alternative approaches, including two that were previously published in the literature, and to the published results of a third approach. Comparing their outcome and running time to ours, using a “traditional ” trees and a standard tree comparison method, our algorithm improved upon the “competition ” by a substantial margin. The simplicity and speed of our method allows for a whole genome analysis with the greatest scope attempted so far. We describe here five different applications of the method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.
Antisequential Suffix Sorting For BWTBase Data Compression
 IEEE Transactions on Computers
, 2005
"... Abstract—Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract—Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memoryefficient algorithms for suffix sorting. For a lengthN input over a sizejXj alphabet, the worstcase complexities of these algorithms are ðN2Þ, OðjXjN logð N ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jXjÞÞ, and OðN jXj logð N jXjÞ q Þ, respectively. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worstcase performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worstcase behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable highspeed BWTbased compression systems.
Spacetime tradeoffs for longestcommonprefix array computation
 In Proc. 19th ISAAC
, 2008
"... Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In particular the LCP array allows one to simulate bottomup and topdown traversals of the suffix tree with significantly less memory overhead (but in the same time bounds). Since 2001 the LCP array has been computable in Θ(n) time, but the algorithm (even after subsequent refinements) requires relatively large working memory. In this paper we describe a new algorithm that provides a continuous spacetime tradeoff for LCP array construction, running in O(nv) time and requiring n+O(n / √ v+v) bytesofworking space, where v can be chosen to suit the available memory. Furthermore, the algorithm processes the suffix array, and outputs the LCP, strictly lefttoright, making it suitable for use with external memory. We show experimentally that for many naturally occurring strings our algorithm is faster than the linear time algorithms, while using significantly less working memory. 1