Results 1  10
of
31
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 42 (10 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Theoretical and practical improvements on the RMQproblem, with applications to LCA and LCE
 PROC. CPM. VOLUME 4009 OF LNCS
, 2006
"... The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of any dynamic data structure. It consumes less than half of the space that is needed by the method by Berkman and Vishkin. We use our new algorithm for RMQ to improve on LCAcomputation for binary trees, and further give a constanttime LCEalgorithm solely based on arrays. Both LCA and LCE have important applications, e.g., in computational biology. Experimental studies show that our new method is almost twice as fast in practice as previous approaches, and asymptotically slower variants of the constanttime algorithms perform even better for today’s common problem sizes.
The engineering of a compression boosting library: Theory vs practice in BWT compression
 In Proc. 14th European Symposium on Algorithms (ESA ’06
, 2006
"... Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to des ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, MovetoFront Encoding is generally believed to be an “inefficient ” part of the BurrowsWheeler compression process. However, only recently two theoretically superior alternatives to MovetoFront have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how MovetoFront behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and MovetoFront yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting orderzero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, MovetoFront, Wavelet Trees, and Boosters can all be bypassed by a fast learner.
Practical methods for constructing suffix trees
, 2005
"... Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluati ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very timeconsuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not
Optimal string mining under frequency constraints
 Closed Sets for Labeled Data?, PKDD, 2006
, 2006
"... Abstract. We propose a new algorithmic framework that solves frequencyrelated data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, eme ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Abstract. We propose a new algorithmic framework that solves frequencyrelated data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ 2test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix and lcparrays, and a new preprocessing scheme for range minimum queries. The advantages of arraybased data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on realworld data from computational biology and demonstrate that the approach also works well in practice. 1
Improving Suffix Array Locality for Fast Pattern Matching on Disk
, 2008
"... The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that queryi ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix arraylike blocked data structure allows queries to be answered as much as three times faster than the best alternative diskbased suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
An analysis of the feasibility of short read sequencing
, 2005
"... Several methods for ultra highthroughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that rese ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Several methods for ultra highthroughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that resequencing and de novo sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs (a contiguous fragment of sequence data) of 1000 nt and greater that cover 80 % of human chromosome 1.
Fast frequent string mining using suffix arrays
 IN: PROC. ICDM, IEEE COMPUTER SOCIETY
, 2005
"... ..."
On the Number of Elements to Reorder When Updating a Suffix Array
, 2011
"... Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorit ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorithms are faster for updating the BurrowsWheeler transform or the suffix array than the fastest reconstruction algorithms. In this article we focus on the number of elements to be reordered for reallife texts. We show that this number is related to LCP values and that, on average, Lave entries are reordered, where Lave denotes the average LCP value, defined as the average length of the longest common prefix between two consecutive sorted suffixes. Since we know little about the LCP distribution for reallife texts, we conduct experiments on a corpus that consists of DNA sequences and natural language texts. The results show that apart from texts containing large repetitions, the average LCP value is close to the one expected on a random text.
Spacetime tradeoffs for longestcommonprefix array computation
 In Proc. 19th ISAAC
, 2008
"... Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In particular the LCP array allows one to simulate bottomup and topdown traversals of the suffix tree with significantly less memory overhead (but in the same time bounds). Since 2001 the LCP array has been computable in Θ(n) time, but the algorithm (even after subsequent refinements) requires relatively large working memory. In this paper we describe a new algorithm that provides a continuous spacetime tradeoff for LCP array construction, running in O(nv) time and requiring n+O(n / √ v+v) bytesofworking space, where v can be chosen to suit the available memory. Furthermore, the algorithm processes the suffix array, and outputs the LCP, strictly lefttoright, making it suitable for use with external memory. We show experimentally that for many naturally occurring strings our algorithm is faster than the linear time algorithms, while using significantly less working memory. 1