Results 1 
5 of
5
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in t ..."
Abstract

Cited by 173 (6 self)
 Add to MetaCart
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Implementations of Randomized Sorting on Large Parallel Machines
"... Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routin ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routing. In this
An Experimental Analysis of Parallel Sorting Algorithms
 THEORY OF COMPUTING SYSTEMS
, 1998
"... We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Ne ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batcherâ€™s bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiantâ€™s flashsort, and implemented them on the connection Machine model CM2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.
Pipelined Parallel Prefix Computations, and Sorting on a Pipelined Hypercube
 Journal of Parallel and Distributed Computing
, 1993
"... This paper brings together a number of previously known techniques in order to obtain practical and efficient implementations of the prefix operation for the complete binary tree, hypercube and shuffle exchange families of networks. For each of these networks, we also provide a "pipelined" scheme ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
This paper brings together a number of previously known techniques in order to obtain practical and efficient implementations of the prefix operation for the complete binary tree, hypercube and shuffle exchange families of networks. For each of these networks, we also provide a "pipelined" scheme for performing k prefix operations in O(k + log p) time on p processors. This implies a similar pipelining result for the "data distribution" operation of Ullman [16]. The data distribution primitive leads to a simplified implementation of the optimal merging algorithm of Varman and Doshi, which runs on a pipelined model of the hypercube [17]. Finally, a pipelined version of the multiway merge sort of Nassimi and Sahni [10], running on the pipelined hypercube model, is described. Given p processors and n ! p log p values to be sorted, the running time of the pipelined algorithm is O(log 2 p= log((p log p)=n)). Note that for the interesting case n = p this yields a running time of...
An Evaluation of Sorting as a Supercomputer Benchmark
"... : We propose that sorting be considered an important benchmark for both scientific and commercial applications of supercomputers. The purpose of a supercomputer benchmark is to exercise various system components in an effort to measure important performance characteristics. In the past numerous ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
: We propose that sorting be considered an important benchmark for both scientific and commercial applications of supercomputers. The purpose of a supercomputer benchmark is to exercise various system components in an effort to measure important performance characteristics. In the past numerous benchmarks have been defined in an effort to measure the performance issues associated with numeric computing. These benchmarks stressed arithmetic operations (in particular, floatingpoint arithmetic). In recent years supercomputers manufactures have started to look closer at nonnumeric processing tasks, such as databases and information retrieval. The ability to operate on large amounts of nonnumeric data will be crucial in the future. This paper discusses the appropriateness of sorting as a benchmark for nonnumeric computing tasks. The paper describes previous work in this area and defines a set of architecture independent sorting benchmarks. Contact: Kurt Thearling phone: (6...