Results 1  10
of
13
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms pro ..."
Abstract

Cited by 177 (5 self)
 Add to MetaCart
(Show Context)
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Parallel Sorting by Regular Sampling
, 1992
"... A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p process ..."
Abstract

Cited by 110 (7 self)
 Add to MetaCart
A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p processors, when n p 3 the algorithm is shown to be asymptotically optimal. In theory, the algorithm is within a factor of two of achieving ideal load balancing. In practice, there is almost perfect partitioning of work. On a variety of shared and distributed memory machines, the algorithm achieves better than halflinear speedups.  1. Introduction Sorting is one of most studied problems in computer science because of its theoretical interest and practical importance. With the advent of parallel processing, parallel sorting has become an important area for algorithm research. Although considerable work has been done on the theory of parallel sorting and efficient implementations on SIMD arch...
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
On the Versatility of Parallel Sorting by Regular Sampling
 Parallel Computing
, 1993
"... Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot ..."
Abstract

Cited by 54 (14 self)
 Add to MetaCart
(Show Context)
Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot achieve comparable performance on other architectures. Parallel Sorting by Regular Sampling (PSRS) is an algorithm that is suitable for a diverse range of MIMD architectures. It has good load balancing properties, modest communication needs and good memory locality of reference. If there are no duplicate keys, PSRS guarantees to balance the work among the processors within a factor of two of optimal in theory, regardless of the data value distribution, and within a few percent of optimal in practice. This paper presents new theoretical and empirical results for PSRS. The theoretical analysis of PSRS is extended to include a lower bound and a tighter upper bound on the work done by a process...
Implementations of Randomized Sorting on Large Parallel Machines
, 1992
"... Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routin ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routing. In this paper we present BFlashsort, a new batchedrouting variant of Flashsort designed to sort N>P values using P processors connected in a ddimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of BFlashsort over Flashsort is that it replaces pipelined splitterdirected routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of BFlashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a ddimensional toroidal mesh BFlashsort improves on Samplesort when (N/P)ּ<ּP/(c 1log P +c 2dP 1/d +c 3), for machinedependent parameters c 1, c 2, and c 3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP1 of Samplesort and two BFlashsort variants.
An Experimental Analysis of Parallel Sorting Algorithms
 THEORY OF COMPUTING SYSTEMS
, 1998
"... We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Ne ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batcher’s bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant’s flashsort, and implemented them on the connection Machine model CM2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.
A Comparison Based Parallel Sorting Algorithm
 IN PROC. INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING
, 1993
"... We present a fast comparison based parallel sorting algorithm that can handle arbitrary key types. Data movement is the major portion of sorting time for most algorithms in the literature. Our algorithm is parameterized so that it can be tuned to control data movement time, especially for large data ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We present a fast comparison based parallel sorting algorithm that can handle arbitrary key types. Data movement is the major portion of sorting time for most algorithms in the literature. Our algorithm is parameterized so that it can be tuned to control data movement time, especially for large data sets. Parallel histograms are used to partition the key set exactly. The algorithm is architecture independent, and has been implemented in the CHARM portable parallel programming system, allowing it to be efficiently run on virtually any MIMD computer. Performance results for sorting different data sets are presented.
An Evaluation of Sorting as a Supercomputer Benchmark
, 1993
"... We propose that sorting be considered an important benchmark for both scientific and commercial applications of supercomputers. The purpose of a supercomputer benchmark is to exercise various system components in an effort to measure important performance characteristics. In the past numerous ben ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose that sorting be considered an important benchmark for both scientific and commercial applications of supercomputers. The purpose of a supercomputer benchmark is to exercise various system components in an effort to measure important performance characteristics. In the past numerous benchmarks have been defined in an effort to measure the performance issues associated with numeric computing. These benchmarks stressed arithmetic operations (in particular, floatingpoint arithmetic). In recent years supercomputers manufactures have started to look closer at nonnumeric processing tasks, such as databases and information retrieval. The ability to operate on large amounts of nonnumeric data will be crucial in the future. This paper discusses the appropriateness of sorting as a benchmark for nonnumeric computing tasks. The paper describes previous work in this area and defines a set of architecture independent sorting benchmarks.
Integer Sorting Algorithms for CoarseGrained Parallel Machines
 In Proc. of Int'l Conference on High Performance Computing HiPC'97
, 1997
"... Integer sorting is a subclass of the sorting problem where the elements have integer values and the largest element is polynomially bounded in the number of elements to be sorted. It is useful for applications in which the size of the maximum value of element to be sorted is bounded. In this paper, ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Integer sorting is a subclass of the sorting problem where the elements have integer values and the largest element is polynomially bounded in the number of elements to be sorted. It is useful for applications in which the size of the maximum value of element to be sorted is bounded. In this paper, we present a new distributed radixsort algorithm for integer sorting. The structure of our algorithm is similar to radix sort except that it typically requires less number of communication phases. We present experimental results for our algorithm on two distributed memory multiprocessors, the Intel Paragon and the Thinking machine CM5. These results are compared with two other well known practical parallel sorting algorithms based on radix sort and sample sort. The experimental results show that the distributed radixsort is competitive with the other two algorithms. 1 Introduction Sorting has been a widely studied problem for parallel machines [8, 9, 12, 14, 13, 10, 1, 4, 7, 6, 11]. Int...
Parallel Sorting by Overpartitioning
 in Proc. ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... A new approach to parallel sorting called Parallel Sorting by OverPartitioning (PSOP) is presented. The approach limits the communication cost by moving each element between processors at most once, and leads to good load balancing with high probability. The PSOP framework can be applied to both com ..."
Abstract
 Add to MetaCart
(Show Context)
A new approach to parallel sorting called Parallel Sorting by OverPartitioning (PSOP) is presented. The approach limits the communication cost by moving each element between processors at most once, and leads to good load balancing with high probability. The PSOP framework can be applied to both comparison and noncomparison sorts. Implementations on the KSR1 and Hector shared memory multiprocessors show that PSOP achieves nearly linear speedup and outperforms alternative approaches. An analytical model for PSOP has been developed that predicts the performance within 10% accuracy. Key Words: Parallel Sorting, Load Balance, Overpartitioning, Oversampling, COMA and NUMA Multiprocessors. 1 Introduction Sorting is an important symbolic application [4], and parallel algorithms for sorting have been studied intensively ever since Batcher [5] proposed the bitonicsorting network. Although the theoretical community has devoted considerable effort to the design of parallel sorting algorithms ...