Results 1  10
of
20
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract

Cited by 80 (4 self)
 Add to MetaCart
(Show Context)
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the BurrowsWheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fulltext index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in websearch engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 76 (12 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
CacheConscious Sorting of Large Sets of Strings with Dynamic Tries
"... Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing stringso ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing stringsorting algorithms as they do not utilise cache particularly well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally compare burstsort to existing stringsorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, due primarily to a lower rate of cache miss.
Implementing Radixsort
 ACM Jour. of Experimental Algorithmics
, 1998
"... We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerab ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerably faster (often more than twice as fast) than comparisonbased sorting methods. This is true even for small input sequences. We also show that it is possible to implement a radix sort with good worstcase running time without sacrificing averagecase performance. Our implementations are competitive with the best previously published string sorting algorithms. Code, test data, and test results are available from the World Wide Web. 1. Introduction Radix sorting is a simple and very efficient sorting method that has received too little attention. A common misconception is that a radix sorting algorithm either has to inspect all the characters of the input or use an inordinate amount of extra...
Cacheefficient string sorting using copying
 In submission
, 2006
"... Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in outofcache references. In this paper we introduce Cburstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CPburstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
Efficient TrieBased Sorting of Large Sets of Strings
 Proceedings of the Australasian Computer Science Conference
, 2003
"... Sorting is a fundamental algorithmic task. Many generalpurpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a triebased algorithm for sorting stri ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Sorting is a fundamental algorithmic task. Many generalpurpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a triebased algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we reevaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.
NTCIR3 PAT Experiments at Osaka Kyoiku University  Long Grambased Index and Essential Words
"... Long grambased indices are experimented at NTCIR3 patent task . To make grambased indices, no analyses such as morphological ones are required. ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Long grambased indices are experimented at NTCIR3 patent task . To make grambased indices, no analyses such as morphological ones are required.
Using random sampling to build approximate tries for efficient string sorting
 In Proc. International Workshop on Efficient and Experimental
"... Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as p ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cacheefficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SRburstsort, DRburstsort, and DRLburstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained. 1
Can GPUs Sort Strings Efficiently?
"... Abstract—String sorting or variablelength key sorting has lagged in performance on the GPU even as the fixedlength key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available r ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—String sorting or variablelength key sorting has lagged in performance on the GPU even as the fixedlength key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70 % of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their aftersort tie lengths. I.
A Pragmatic Implemention of Monotone Priority Queues
 In DIMACS’96 implementation challenge
, 1996
"... Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparisonbased lower bounds. Inspired by these advances, and by the fact that algorithms based ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparisonbased lower bounds. Inspired by these advances, and by the fact that algorithms based on indirect addressing have proven to be efficient in many practical applications, we have implemented a triebased priority queue as part of the DIMACS implementation challenge. Having Dijkstra's single source shortest path algorithm in mind, we decided to restrict our attention to monotone priority queues, as defined in [9]. A monotone priority queue, is a priority queue where the minimum is nondecreasing  the minimum of an empty monotone priority queue is defined to be 0. The monotonicity condition is not a problem for greedy algorithms such as Dijkstra's single source shortest paths algorithm. Also, monotonicity is satisfied in eventsimulations. According to t