Results 1 - 10
of
10
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract
-
Cited by 57 (4 self)
- Add to MetaCart
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in web-search engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
A taxonomy of suffix array construction algorithms
- ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Implementing Radixsort
- ACM Jour. of Experimental Algorithmics
, 1998
"... We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worst-case behavior. Our experimental results indicate that radix sorting is considerab ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worst-case behavior. Our experimental results indicate that radix sorting is considerably faster (often more than twice as fast) than comparison-based sorting methods. This is true even for small input sequences. We also show that it is possible to implement a radix sort with good worst-case running time without sacrificing average-case performance. Our implementations are competitive with the best previously published string sorting algorithms. Code, test data, and test results are available from the World Wide Web. 1. Introduction Radix sorting is a simple and very efficient sorting method that has received too little attention. A common misconception is that a radix sorting algorithm either has to inspect all the characters of the input or use an inordinate amount of extra...
Cache-efficient string sorting using copying
- In submission
, 2006
"... Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
NTCIR-3 PAT Experiments at Osaka Kyoiku University - Long Gram-based Index and Essential Words
"... Long gram-based indices are experimented at NTCIR-3 patent task . To make gram-based indices, no analyses such as morphological ones are required. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Long gram-based indices are experimented at NTCIR-3 patent task . To make gram-based indices, no analyses such as morphological ones are required.
Efficient Trie-Based Sorting of Large Sets of Strings
- Proceedings of the Australasian Computer Science Conference
, 2003
"... Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting stri ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.
Using random sampling to build approximate tries for efficient string sorting
- In Proc. International Workshop on Efficient and Experimental
"... Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as p ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained. 1
A Pragmatic Implemention of Monotone Priority Queues
- In DIMACS’96 implementation challenge
, 1996
"... Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparison-based lower bounds. Inspired by these advances, and by the fact that algorithms based ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparison-based lower bounds. Inspired by these advances, and by the fact that algorithms based on indirect addressing have proven to be efficient in many practical applications, we have implemented a trie-based priority queue as part of the DIMACS implementation challenge. Having Dijkstra's single source shortest path algorithm in mind, we decided to restrict our attention to monotone priority queues, as defined in [9]. A monotone priority queue, is a priority queue where the minimum is non-decreasing -- the minimum of an empty monotone priority queue is defined to be 0. The monotonicity condition is not a problem for greedy algorithms such as Dijkstra's single source shortest paths algorithm. Also, monotonicity is satisfied in event-simulations. According to t
Categories and Subject Descriptors: F.2.2 [Theory of Computation]: Nonnumerical Algorithms and Problems
"... We present and evaluate several optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worst-case behavior. Our experimental results indicate that radix sorting is considerably f ..."
Abstract
- Add to MetaCart
We present and evaluate several optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worst-case behavior. Our experimental results indicate that radix sorting is considerably faster (often more than twice as fast) than comparison-based sorting methods. This is true even for small input sequences. We also show that it is possible to implement a radixsort with good worst-case running time without sacrificing average-case performance. Our implementations are competitive with the best previously published string sorting programs.
An Efficient, Versatile . . .
, 2008
"... Sorting the suffixes of a string into lexicographical order is a fundamental task in a number of contexts, most notably lossless compression (Burrows–Wheeler transformation) and text indexing (suffix arrays). Most approaches to suffix sorting produce a sorted array of suffixes directly, continually ..."
Abstract
- Add to MetaCart
Sorting the suffixes of a string into lexicographical order is a fundamental task in a number of contexts, most notably lossless compression (Burrows–Wheeler transformation) and text indexing (suffix arrays). Most approaches to suffix sorting produce a sorted array of suffixes directly, continually moving suffixes into their final place in the array until the ordering is complete. In this article, we describe a novel and resource-efficient (time and memory) approach to suffix sorting, which works in a complementary way—by assigning each suffix its rank in the final ordering, before converting to a sorted array, if necessary, once all suffixes are ranked. We layer several powerful extensions on this basic idea and show experimentally that our approach is superior to other leading algorithms in a variety of real-world contexts.

