Results 1  10
of
46
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms pro ..."
Abstract

Cited by 177 (5 self)
 Add to MetaCart
(Show Context)
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management
, 2006
"... We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We ..."
Abstract

Cited by 114 (10 self)
 Add to MetaCart
(Show Context)
We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the highbandwidth GPU memory interface and the lowerbandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPUbased algorithms. GPUTeraSort is a twophase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves nearpeak I/O performance. We have tested the performance of GPUTeraSort on billionrecord files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPUbased algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort priceperformance. These results suggest that a GPU coprocessor can significantly improve performance on large data processing tasks. 1.
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
(Show Context)
We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix sort is up to 4 times faster than the graphicsbased GPUSort and greater than 2 times faster than other CUDAbased radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial finegrained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the highspeed onchip shared memory provided by NVIDIA’s GPU architecture and efficient dataparallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
Fast Parallel Sorting Under LogP: Experience with the CM5
 IEEE Transactions on Parallel and Distributed Systems
, 1996
"... In this paper, the LogP model is used to analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort). LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of pr ..."
Abstract

Cited by 54 (10 self)
 Add to MetaCart
(Show Context)
In this paper, the LogP model is used to analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort). LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in SplitC, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM5 of 32 to 512 processors for a range of problem sizes and input sets. The sensitivity of the algorithms is evaluated by varying the distribution of key values and the rank ordering of the input. The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. Using an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes for all four algorithms. Communication performance is oblivious to the distribution of the keys values, whereas the local sort performance is not. The communication phases in radix and sample sort are sensitive to the ordering of keys, because certain layouts result in contention. 1
Relational joins on graphics processors
, 2007
"... We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
(Show Context)
We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming model for generalpurpose computing. Taking advantage of these new features, we design a set of dataparallel primitives such as scan, scatter and split, and use these primitives to implement indexed or nonindexed nestedloop, sortmerge and hash joins. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU and use parallel computation to effectively hide the memory latency. We have implemented our algorithms on a PC with an NVIDIA G80 GPU and an Intel P4 dualcore CPU. Our GPUbased algorithms are able to achieve 220 times higher performance than their CPUbased counterparts. 1.
Cellsort: High performance sorting on the cell processor
 In Proc. VLDB
, 2007
"... In this paper we describe the design and implementation of CellSort − a high performance distributed sort algorithm for the Cell processor. We design CellSort as a distributed bitonic merge with a dataparallel bitonic sorting kernel. In order to best exploit the architecture of the Cell processor a ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
In this paper we describe the design and implementation of CellSort − a high performance distributed sort algorithm for the Cell processor. We design CellSort as a distributed bitonic merge with a dataparallel bitonic sorting kernel. In order to best exploit the architecture of the Cell processor and make use of all available forms of parallelism to achieve good scalability, we structure CellSort as a threetiered sort. The first tier is a SIMD (singleinstruction multiple data) optimized bitonic sort, which sorts up to 128KB of items that cat fit into one SPE’s (a coprocessor on Cell) local store. We design a comprehensive SIMDization scheme that employs data parallelism even for the most finegrained steps of the bitonic sorting kernel. Our results show that, SIMDized bitonic sorting kernel is vastly superior to other alternatives on the SPE and performs up to 1.7 times faster compared to quick sort on 3.2GHz Intel Xeon. The second tier is an incore bitonic merge optimized for crossSPE data transfers via asynchronous DMAs, and sorts enough number of items that can fit into the cumulative space available on the local stores of the participating SPEs. We design data transfer and synchronization patters that minimize serial sections of the code by taking advantage of the high aggregate crossSPE bandwidth available on Cell. Results show that, incore bitonic sort scales well on the Cell processor with increasing number of SPEs, and performs up to 10 times faster with 16 SPEs compared to parallel quick sort on dual3.2GHz Intel Xeon. The third tier is an outofcore 1 bitonic merge which sorts large number of items stored in the main memory. Results show that, when properly implemented, distributed outofcore bitonic sort on Cell can significantly outperform the asymptotically (average case) superior quick sort for large number of memory resident items (up to 4 times faster when sorting 0.5GB of data with 16 SPEs, compared to dual3.2GHz Intel Xeon). 1 The term “outofcore ” does not imply a diskbased sort in the context of this paper. However, relation to external sorting is strong (see Sections 2 and 3 for details).
Accounting for memory bank contention and delay in highbandwidth multiprocessors
 In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulksynchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)BSP. We show experimentally that the (d, x)BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machinespecific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)BSP as a bridging model for emulating a very highlevel abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)BSP.
Fast Parallel Sorting under LogP: from theory to practice
, 1994
"... The LogP model characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
The LogP model characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) under LogP. We develop implementations of these algorithms in a parallel extension to C and compare the actual performance on a CM5 of 32 to 512 processors with that predicted by LogP using parameter values for this machine. Our experience was that the model served as a valuable guide throughout the development of the fast parallel sorts and revealed subtle defects in the implementations. The final observed performance matches closely with the prediction across a broad range of problem and machine sizes. 1.2 INTRODUCTION Fast sorting is important in a wide variety of practical applications, is interesting to study from a theoretical viewpoint, and offers a wealth of novel parallel solutio...
An Experimental Analysis of Parallel Sorting Algorithms
 THEORY OF COMPUTING SYSTEMS
, 1998
"... We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Ne ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batcher’s bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant’s flashsort, and implemented them on the connection Machine model CM2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.