Results 1  10
of
27
Fast Parallel GPUSorting Using a Hybrid Algorithm
"... Abstract — This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. Initially, a parallel bucketsort splits the list into enough sublists then to be sorted in ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Abstract — This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. Initially, a parallel bucketsort splits the list into enough sublists then to be sorted in parallel using mergesort. The parallel bucketsort, implemented in NVIDIA’s CUDA, utilizes the synchronization mechanisms, such as atomic increment, that is available on modern GPUs. The mergesort requires scattered writing, which is exposed by CUDA and ATI’s Data Parallel Virtual Machine[1]. For lists with more than 512k elements, the algorithm performs better than the bitonic sort algorithms, which have been considered to be the fastest for GPU sorting, and is more than twice as fast for 8M elements. It is 614 times faster than single CPU quicksort for 18M elements respectively. In addition, the new GPUalgorithm sorts on n log n time as opposed to the standard n(log n) 2 for bitonic sort. Recently, it was shown how to implement GPUbased radixsort, of complexity n log n, to outperform bitonic sort. That algorithm is, however, still up to ∼ 40 % slower for 8M elements than the hybrid algorithm presented in this paper. GPUsorting is memory bound and a key to the high performance is that the mergesort works on groups of fourfloat values to lower the number of memory fetches. Finally, we demonstrate the performance on sorting vertex distances for two large 3Dmodels; a key in for instance achieving correct transparency. I.
GPU Sample Sort
, 2009
"... In this paper, we present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparisonbased sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at lea ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
In this paper, we present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparisonbased sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at least 25 % and on average 68 % faster than the best comparisonbased sorting algorithm, GPU Thrust merge sort, and on average more than 2 times faster than GPU quicksort. Moreover, for 64bit integer keys it is at least 63 % and on average 2 times faster than the highly optimized GPU Thrust radix sort that directly manipulates the binary representation of keys. Our implementation is robust to different distributions and entropy levels of keys and scales almost linearly with the input size. These results indicate that multiway techniques in general and sample sort in particular achieve substantially better performance than twoway merge sort and quicksort.
AASort: A New Parallel Sorting Algorithm for
 MultiCore SIMD Processors” in PACT 07
, 2007
"... Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both SIMD instructions and threadlevel parallelism. In this paper, we propose a new parallel sorting algorithm, called AlignedAccess sort (AAsort), for sharedmemory multi proces ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both SIMD instructions and threadlevel parallelism. In this paper, we propose a new parallel sorting algorithm, called AlignedAccess sort (AAsort), for sharedmemory multi processors. The AAsort algorithm takes advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions. We implemented and evaluated the AAsort on PowerPC ® 970MP and Cell Broadband Engine TM. In summary, a sequential version of the AAsort using SIMD instructions outperformed IBM’s optimized sequential sorting library by 1.8 times and GPUTeraSort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 M of random 32bit integers. Furthermore, a parallel version of AAsort demonstrated better scalability with increasing numbers of cores than a parallel version of GPUTeraSort on both platforms. 1.
Faster Lightweight Suffix Array Construction
"... The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward e ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward efficient construction of suffix arrays with algorithms striving not only to be fast, but also “lightweight” (in the sense that they use small working memory). In this paper we describe a new lightweight suffix array construction algorithm. By exploiting several interesting properties of suffixes in combination with cache concious programming we acheive excellent runtimes. Extensive experiments show our approach to be faster that all other known algorithms for the task.
Massively parallel sortmerge joins in main memory multicore database systems
 PVLDB
, 2012
"... Two emerging hardware trends will dominate the database system technology in the near future: increasing main memorycapacitiesofseveralTBperserverandmassivelyparallel multicore processing. Many algorithmic and control techniques in current database technology were devised for diskbased systems wher ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Two emerging hardware trends will dominate the database system technology in the near future: increasing main memorycapacitiesofseveralTBperserverandmassivelyparallel multicore processing. Many algorithmic and control techniques in current database technology were devised for diskbased systems where I/O dominated the performance. In this work we take a new look at the wellknown sortmerge join which, so far, has not been in the focus of research in scalable massively parallel multicore data processing as it was deemed inferior to hash joins. We devise a suite of new massively parallel sortmerge (MPSM) join algorithms that are based on partial partitionbased sorting. Contrary to classical sortmerge joins, our MPSM algorithms do not rely on a hard to parallelize final merge step to create one complete sort order. Rather they work on the independently created runs in parallel. This way our MPSM algorithms are NUMAaffine as all the sorting is carried out on local memory partitions. An extensive experimental evaluation on a modern 32core machine with one TB of main memory proves the competitive performance of MPSM on large main memory databases with billions of objects. It scales (almost) linearly in the number of employed cores and clearly outperforms competing hash join proposals – in particular it outperforms the “cuttingedge ” Vectorwise parallel query engine by a factor of four. 1.
GPUQuicksort: A Practical Quicksort Algorithm for Graphics Processors
"... In this paper we describe GPUQuicksort, an efficient Quicksort algorithm suitable for highly parallel multicore graphics processors. Quicksort has previously been considered an inefficient sorting solution for graphics processors, but we show that in CUDA, NVIDIA’s programming platform for general ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this paper we describe GPUQuicksort, an efficient Quicksort algorithm suitable for highly parallel multicore graphics processors. Quicksort has previously been considered an inefficient sorting solution for graphics processors, but we show that in CUDA, NVIDIA’s programming platform for general purpose computations on graphical processors, GPUQuicksort performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.
Optimal splitters for database partitioning with size bounds
 In Proceedings of the International Conference on Database Theory
, 2009
"... Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equalsized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (sp ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equalsized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (splitters) for a data set. Unlike traditional partitioning elements, our splitters define both inequality and equality partitions, which allows us to bound the size of the inequality partitions. We provide an algorithm for determining an optimal set of splitters from a sorted data set and show that it has time complexity O(k lg 2 N), where k is the number of splitters requested and N is the size of the data set. We show how the algorithm can be extended to pairs of tables, so that joins can be partitioned into work units that have balanced cost. We demonstrate experimentally (a) that finding the optimal set of splitters can be done efficiently, and (b) that using the precomputed splitters can improve the time to sort a data set by up to 76%, with particular benefits in the presence of a few heavy hitters. 1.
Stochastic Database Cracking: Towards Robust Adaptive Indexing in MainMemory ColumnStores ⇤
"... Modern business applications and scientific databases call for inherently dynamic data storage environments. Such environments are characterized by two challenging features: (a) they have little idle system time to devote on physical design; and (b) there is little, if any, a priori workload knowled ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Modern business applications and scientific databases call for inherently dynamic data storage environments. Such environments are characterized by two challenging features: (a) they have little idle system time to devote on physical design; and (b) there is little, if any, a priori workload knowledge, while the query and data workload keeps changing dynamically. In such environments, traditional approaches to index building and maintenance cannot apply. Database cracking has been proposed as a solution that allows onthefly physical data reorganization, as a collateral effect of query processing. Cracking aims to continuously and automatically adapt indexes to the workload at hand, without human intervention. Indexes are built incrementally, adaptively, and on demand. Nevertheless, as we show, existing adaptive indexing methods fail to deliver workloadrobustness; they perform much better with random workloads than with others. This frailty derives from the inelasticity with which these approaches interpret each query as a hint on how data should be stored. Current cracking schemes blindly reorganize the data within each query’s range, even if that results into successive expensive operations with minimal indexing benefit. In this paper, we introduce stochastic cracking, a significantly more resilient approach to adaptive indexing. Stochastic cracking also uses each query as a hint on how to reorganize data, but not blindly so; it gains resilience and avoids performance bottlenecks by deliberately applying certain arbitrary choices in its decisionmaking. Thereby, we bring adaptive indexing forward to a mature formulation that confers the workloadrobustness previous approaches lacked. Our extensive experimental study verifies that stochastic cracking maintains the desired properties of original database cracking while at the same time it performs well with diverse realistic workloads. 1.
Fast Focus+Context Visualization of Large Scientific Data
, 2004
"... Visualization of highdimensional and timedependent data, resulting from computational simulation, is a very challenging and resourceconsuming task. Here, featurebased visualization approaches, which aim at a usercontrolled reduction of the data shown at one instance of time, proof to be useful. ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Visualization of highdimensional and timedependent data, resulting from computational simulation, is a very challenging and resourceconsuming task. Here, featurebased visualization approaches, which aim at a usercontrolled reduction of the data shown at one instance of time, proof to be useful.
An introspective algorithm for the integer determinant
 In: Proceedings of Transgressive Computing 2006
, 2006
"... ljk.imag.fr/membres/{JeanGuillaume.Dumas;Anna.Urbanska} We present an algorithm for computing the determinant of an integer matrix A. The algorithm is introspective in the sense that it uses several distinct algorithms that run in a concurrent manner. During the course of the algorithm partial resu ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
ljk.imag.fr/membres/{JeanGuillaume.Dumas;Anna.Urbanska} We present an algorithm for computing the determinant of an integer matrix A. The algorithm is introspective in the sense that it uses several distinct algorithms that run in a concurrent manner. During the course of the algorithm partial results coming from distinct methods can be combined. Then, depending on the current running time of each method, the algorithm can emphasize a particular variant. With the use of very fast modular routines for linear algebra, our implementation is an order of magnitude faster than other existing implementations. Moreover, we prove that the expected complexity of our algorithm is only O � n 3 log 2.5 (n�A�) � bit operations in the case of random dense matrices, where n is the dimension and �A � is the largest entry in the absolute value of the matrix. 1