Results 1  10
of
19
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more exp ..."
Abstract

Cited by 149 (6 self)
 Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple lineartime construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a spaceefficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREWPRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix ..."
Abstract

Cited by 55 (4 self)
 Add to MetaCart
We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix sort is up to 4 times faster than the graphicsbased GPUSort and greater than 2 times faster than other CUDAbased radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial finegrained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the highspeed onchip shared memory provided by NVIDIA’s GPU architecture and efficient dataparallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Primitive Operations on the BSP Model
, 1996
"... The design of a complex algorithm relies heavily on a set of primitive operations and the instruments required to compile these operations into an algorithm. In this work, we examine some of these basic primitive operations and present algorithms that are suitable for the BulkSynchronous Parallel m ..."
Abstract

Cited by 17 (14 self)
 Add to MetaCart
The design of a complex algorithm relies heavily on a set of primitive operations and the instruments required to compile these operations into an algorithm. In this work, we examine some of these basic primitive operations and present algorithms that are suitable for the BulkSynchronous Parallel model. In particular, we consider algorithms for the following primitive operations: broadcasting, parallelprefix, merging, generalized and integer sorting. While our algorithms are fairly simple themselves, description of their performance in terms of the BSP parameters is somewhat complicated. The main reward for quantifying these complications, is that it enables software to be written once and for all that can be migrated efficiently among a variety of parallel machines.
A Parallel Priority Queue with Constant Time Operations
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1998
"... We present a parallel priority queue that supports the following operations in constant time: parallel insertion of a sequence of elements ordered according to key, parallel decrease key for a sequence of elements ordered according to key, deletion of the minimum key element, as well as deletion ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
We present a parallel priority queue that supports the following operations in constant time: parallel insertion of a sequence of elements ordered according to key, parallel decrease key for a sequence of elements ordered according to key, deletion of the minimum key element, as well as deletion of an arbitrary element. Our data structure is the first to support multi insertion and multi decrease key in constant time. The priority queue can be implemented on the EREW PRAM, and can perform any sequence of n operations in O(n) time and O(m log n) work, m being the total number of keys inserted and/or updated. A main application is a parallel implementation of Dijkstra's algorithm for the singlesource shortest path problem, which runs in O(n) time and O(m log n) work on a CREW PRAM on graphs with n vertices and m edges. This is a logarithmic factor improvement in the running time compared with previous approaches.
An autotuned method for solving large tridiagonal systems on the GPU
 In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS
, 2011
"... Abstract—We present a multistage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of onchip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Abstract—We present a multistage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of onchip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them onchip. The multistage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an autotuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that autotuning is a powerful tool that improves the performance by up to 5x, saves 17 % and 32 % of execution time on average respectively over static and dynamic tuning, and enables our multistage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6–11x. KeywordsGPU Computing, AutoTuning Algorithms, Tridiagonal systems I.
Efficient Deterministic Sorting on the BSP Model
 PARALLEL PROCESSING LETTERS
, 1996
"... We present a new algorithm for deterministic sorting on the BulkSynchronous Parallel (BSP) model of computation. We sort n general keys using a partitioning scheme that achieves the requirements of efficiency (1optimality) and insensitivity against data skew. Although we employ sampling in order t ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
We present a new algorithm for deterministic sorting on the BulkSynchronous Parallel (BSP) model of computation. We sort n general keys using a partitioning scheme that achieves the requirements of efficiency (1optimality) and insensitivity against data skew. Although we employ sampling in order to realize efficiency, we can give a precise worstcase estimation of the maximum imbalance which might occur. The algorithm is 1optimal for a wide range of the BSP parameters in the sense that its speedup on p processors is asymptotically (1 \Gamma o(1))p. Experimental results for the algorithm are also presented.
Language and library support for practical PRAM programming
 5 EUROMICO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING
, 1997
"... We investigate the wellknown PRAM model of parallel computation as a practical parallel programming model. The two components of this project are a generalpurpose PRAM programming language called Fork95, and a library, called PAD, of efficient, basic parallel algorithms and data structures. We out ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We investigate the wellknown PRAM model of parallel computation as a practical parallel programming model. The two components of this project are a generalpurpose PRAM programming language called Fork95, and a library, called PAD, of efficient, basic parallel algorithms and data structures. We outline the primary features of Fork95 as they apply to the implementation of PAD. We give a brief overview of PAD and sketch the implementation of library routines for prefixsums and bucket sorting. Both language and library can be used with the SBPRAM, an emulation of the PRAM in hardware.
WorkTime Optimal kmerge Algorithms on the PRAM
 IEEE Trans. on Parallel and Distributed Systems
, 1998
"... The kmerge problem, given a collection of k, (2 k n), sorted sequences of total length n, asks to merge them into a new sorted sequence. The main contribution of this work is to propose simple and intuitive worktime optimal algorithms for the kmerge problem on two PRAM models. Specifically, our k ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The kmerge problem, given a collection of k, (2 k n), sorted sequences of total length n, asks to merge them into a new sorted sequence. The main contribution of this work is to propose simple and intuitive worktime optimal algorithms for the kmerge problem on two PRAM models. Specifically, our kmerge algorithms perform O(n log k) work and run in O(log n) time on the EREWPRAM and in O(log log n+log k) time on the CREWPRAM, respectively. 1