Results 1  10
of
18
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
(Show Context)
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the BurrowsWheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fulltext index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in websearch engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 42 (10 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Implementing Radixsort
 ACM Jour. of Experimental Algorithmics
, 1998
"... We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerab ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerably faster (often more than twice as fast) than comparisonbased sorting methods. This is true even for small input sequences. We also show that it is possible to implement a radix sort with good worstcase running time without sacrificing averagecase performance. Our implementations are competitive with the best previously published string sorting algorithms. Code, test data, and test results are available from the World Wide Web. 1. Introduction Radix sorting is a simple and very efficient sorting method that has received too little attention. A common misconception is that a radix sorting algorithm either has to inspect all the characters of the input or use an inordinate amount of extra...
Cacheefficient string sorting using copying
 In submission
, 2006
"... Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in outofcache references. In this paper we introduce Cburstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CPburstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
NTCIR3 PAT Experiments at Osaka Kyoiku University  Long Grambased Index and Essential Words
"... Long grambased indices are experimented at NTCIR3 patent task . To make grambased indices, no analyses such as morphological ones are required. ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Long grambased indices are experimented at NTCIR3 patent task . To make grambased indices, no analyses such as morphological ones are required.
Efficient TrieBased Sorting of Large Sets of Strings
 Proceedings of the Australasian Computer Science Conference
, 2003
"... Sorting is a fundamental algorithmic task. Many generalpurpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a triebased algorithm for sorting stri ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Sorting is a fundamental algorithmic task. Many generalpurpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a triebased algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we reevaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.
Efficient adaptive inplace radix sorting
 Informatica
, 2004
"... Abstract. This paper presents a new inplace pseudo linear radix sorting algorithm. The proposed algorithm, called MSL (Map Shuffle Loop) is an improvement over ARL (Maus, 2002). The ARL algorithm uses an inplace permutation loop of linear complexity in terms of input size. MSL uses a faster permut ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a new inplace pseudo linear radix sorting algorithm. The proposed algorithm, called MSL (Map Shuffle Loop) is an improvement over ARL (Maus, 2002). The ARL algorithm uses an inplace permutation loop of linear complexity in terms of input size. MSL uses a faster permutation loop searching for the next element to permute group by group, instead of element by element. The algorithm and its runtime behavior are discussed in detail. The performance of MSL is compared with quicksort and the fastest variant of radix sorting algorithms, which is the Least Significant Digit (LSD) radix sorting algorithm (Sedgewick, 2003).
Using random sampling to build approximate tries for efficient string sorting
 In Proc. International Workshop on Efficient and Experimental
"... Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as p ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cacheefficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SRburstsort, DRburstsort, and DRLburstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained. 1
A Pragmatic Implemention of Monotone Priority Queues
 In DIMACS’96 implementation challenge
, 1996
"... Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparisonbased lower bounds. Inspired by these advances, and by the fact that algorithms based ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Introduction Recently there have been several theoretical improvements in the area of sorting, priority queues, and searching [1, 2, 8, 9]. All these improvements use indirect addressing to surpass the comparisonbased lower bounds. Inspired by these advances, and by the fact that algorithms based on indirect addressing have proven to be efficient in many practical applications, we have implemented a triebased priority queue as part of the DIMACS implementation challenge. Having Dijkstra's single source shortest path algorithm in mind, we decided to restrict our attention to monotone priority queues, as defined in [9]. A monotone priority queue, is a priority queue where the minimum is nondecreasing  the minimum of an empty monotone priority queue is defined to be 0. The monotonicity condition is not a problem for greedy algorithms such as Dijkstra's single source shortest paths algorithm. Also, monotonicity is satisfied in eventsimulations. According to t
A Taxonomy of Suffix Array Construction Algorithms
"... 1. INTRODUCTION Suffix arrays were introduced in 1990 [Manber and Myers 1990; 1993], along withalgorithms for their construction and use as a spacesaving alternative to suffix ..."
Abstract
 Add to MetaCart
(Show Context)
1. INTRODUCTION Suffix arrays were introduced in 1990 [Manber and Myers 1990; 1993], along withalgorithms for their construction and use as a spacesaving alternative to suffix