Results 1  10
of
34
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more exp ..."
Abstract

Cited by 214 (6 self)
 Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple lineartime construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a spaceefficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREWPRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract

Cited by 79 (3 self)
 Add to MetaCart
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the BurrowsWheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fulltext index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in websearch engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 76 (12 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Lineartime construction of suffix arrays
 In Proc. 14th Symposium on Combinatorial Pattern Matching (CPM ’03
, 2003
"... Abstract. The time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constantsize alphabet or an integer alphabet and O(n logn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(n l ..."
Abstract

Cited by 68 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constantsize alphabet or an integer alphabet and O(n logn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(n logn) even for a constantsize alphabet. In this paper we present a lineartime algorithm to construct suffix arrays for integer alphabets, which do not use suffix trees as intermediate data structures during its construction. Since the case of a constantsize alphabet can be subsumed in that of an integer alphabet, our result implies that the time complexity of directly constructing suffix arrays matches that of constructing suffix trees. 1
Two space saving tricks for linear time LCP computation
, 2004
"... Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the inpu ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we consider the linear time algorithm of Kasai et al. [6] for the computation of the Longest Common Prefix (LCP) array given the text and the suffix array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the input (the text and the suffix array) and the output (the LCP array). Thus, for a text of length n, we reduce the space occupancy of this algorithm from 13n bytes to 9n bytes. We also consider the problem of computing the LCP array by “overwriting” the suffix array. For this problem we propose an algorithm whose space occupancy can be bounded in terms of the empirical entropy of the input text. Experiments show that for linguistic texts our algorithm uses roughly 7n bytes. Our algorithm makes use of the BurrowsWheeler Transform even if it does not represent any data in compressed form. To our knowledge this is the first application of the BurrowsWheeler Transform outside the domain of data compression. The source code for the algorithms described in this paper has been included in the lightweight suffix sorting package [13] which is freely available under the GNU GPL. 1
Better external memory suffix array construction
 In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract

Cited by 40 (4 self)
 Add to MetaCart
(Show Context)
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
Fast BWT in small space by blockwise suffix sorting
 Theoretical Computer Science
"... ..."
(Show Context)
Picky: oligo microarray design for large genomes
 Bioinformatics
, 2004
"... *To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many exi ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
*To whom correspondence should be addressed. Motivation: Many large genomes are getting sequenced nowadays. Biologists are eager to start microarray analysis taking advantage of all known genes of a species, but existing microarray design tools were very inefficient for large genomes. Also, many existing tools operate in a batch mode that does not assure best designs. Results: PICKY is an efficient oligo microarray design tool for large genomes. PICKY integrates novel computer science techniques and the best known nearestneighbor parameters to quickly identify sequence similarities and estimate their hybridization properties. Oligos designed by PICKY are computationally optimized to guarantee the best specificity, sensitivity and uniformity under the given design constrains. PICKY can be used to design arrays for whole genomes, or for only a subset of genes. The latter can still be screened against a whole genome to attain the same quality as a whole genome array, thereby permitting low budget, pathwayspecific experiments to be conducted with large genomes. PICKY is the fastest oligo array design tool currently available to the public, requiring only a few hours to process large gene sets from rice, maize or human. Availability: PICKY is independent of any external software to execute, is designed for nonprogrammers to easily operate through a graphical user interface, and is made available for all major computing platforms (e.g., Mac, Windows and Linux) at
The Average Common Substring Approach to Phylogenomic Reconstruction
, 2005
"... We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of maximum common substrings. It is intrinsically related to information theoretic tools (KullbackLeibler relative entropy). We present an algorithm for efficiently computing these distances. In principle, the distance of two ℓ long sequences can be calculated in O(ℓ) time. We implemented the algorithm, using suffix arrays. The implementation is fast enough to enable the construction of the proteome phylogenomic tree for hundreds of species, and the genome phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with “acceptable phylogenetic and taxonomic truth”. To assess our approach, it was compared to the traditional (single gene or protein based) maximum likelihood method. It was compared to implementations of a number of alternative approaches, including two that were previously published in the literature, and to the published results of a third approach. Comparing their outcome and running time to ours, using a “traditional ” trees and a standard tree comparison method, our algorithm improved upon the “competition ” by a substantial margin. The simplicity and speed of our method allows for a whole genome analysis with the greatest scope attempted so far. We describe here five different applications of the method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.