Results 1 - 10
of
32
A Comparison of Sorting Algorithms for the Connection Machine CM-2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM-2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in t ..."
Abstract
-
Cited by 166 (6 self)
- Add to MetaCart
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM-2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64K-processor CM-2, our sample sort implementation can sort 32 10 6 64-bit keys in 5.1 seconds, which is over 10 times faster than the CM-2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Direct Bulk-Synchronous Parallel Algorithms
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1992
"... We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronisation. We do this for the bulk-synchronous paralle ..."
Abstract
-
Cited by 157 (26 self)
- Add to MetaCart
We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronisation. We do this for the bulk-synchronous parallel (BSP) model, which abstracts the characteristics of a parallel machine into three numerical parameters p, g, and L, corresponding to processors, bandwidth, and periodicity respectively. The model differentiates memory that is local to a processor from that which is not, but, for the sake of universality, does not differentiate network proximity. The advantages of this model in supporting shared memory or PRAM style programming have been treated elsewhere. Here we emphasise the viability of an alternative direct style of programming where, for the sake of efficiency the programmer retains control of memory allocation. We show that optimality to within a multiplicative factor close to one ca...
A Limit Theorem for "Quicksort"
- Applications/Theoretical Informatics and Applications
, 1999
"... Let X n be the number of comparisons needed by the sorting algorithm Quicksort to sort a list of n numbers into their natural ordering. We show that (X n \Gamma E(X n ))=n converges weakly to some random variable Y. The distribution of Y is characterized as the fixed point of some contraction. It sa ..."
Abstract
-
Cited by 82 (2 self)
- Add to MetaCart
Let X n be the number of comparisons needed by the sorting algorithm Quicksort to sort a list of n numbers into their natural ordering. We show that (X n \Gamma E(X n ))=n converges weakly to some random variable Y. The distribution of Y is characterized as the fixed point of some contraction. It satisfies a recursive equation, which is used to provide recursive relations for the moments. The random variable Y has exponential tails. Therefore the probability that Quicksort performs badly, e.g. that X n is larger than 2E(X n ) converges polynomially fast of every order to zero. R'esum'e Soit X n le nombre de comparaisons utilis'ees par la proc'edure Quicksort pour trier une liste de nombres distincts. Nous d'emontrons que (X n \Gamma E(X n ))=n converge faiblement vers une certaine variable al'eatoire Y. La distribution de Y est le point fixe d'une contraction et peut etre calcul'ee num'eriquement par it'eration. Keywords: sorting algorithm quicksort, fixed point, asymptotic distribut...
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract
-
Cited by 80 (5 self)
- Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics. Key words: Bulk synchronous parallel, parallel communications library, one-sided communication. 1 Introduction Since the earliest days of computing it has been clear that, sooner or lat...
Parallel sorting on a shared-nothing architecture using probabilistic splitting
, 1991
"... We consider the problem of external sorting in a shared-nothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a paralle ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
We consider the problem of external sorting in a shared-nothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles. We present analytic results showing that probabilistic splitting performs better than exact splitting. Finally, we present experimental results from an implementation of sorting via probabilistic splitting in the Gamma parallel database machine.
Communication-Efficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparison-based sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
A Survey of Adaptive Sorting Algorithms
, 1992
"... Introduction and Survey; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems --- Sorting and Searching; E.5 [Data]: Files --- Sorting/searching; G.3 [Mathematics of Computing]: Probability and Statistics --- Probabilistic algorithms; E.2 [Data Storage Represe ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
Introduction and Survey; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems --- Sorting and Searching; E.5 [Data]: Files --- Sorting/searching; G.3 [Mathematics of Computing]: Probability and Statistics --- Probabilistic algorithms; E.2 [Data Storage Representation]: Composite structures, linked representations. General Terms: Algorithms, Theory. Additional Key Words and Phrases: Adaptive sorting algorithms, Comparison trees, Measures of disorder, Nearly sorted sequences, Randomized algorithms. A Survey of Adaptive Sorting Algorithms 2 CONTENTS INTRODUCTION I.1 Optimal adaptivity I.2 Measures of disorder I.3 Organization of the paper 1.WORST-CASE ADAPTIVE (INTERNAL) SORTING ALGORITHMS 1.1 Generic Sort 1.2 Cook--Kim division 1.3 Partition Sort 1.4 Exponential Search 1.5 Adaptive Merging 2.EXPECTED-CASE ADAPTIV
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
Implementations of Randomized Sorting on Large Parallel Machines
"... Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitter-directed routin ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitter-directed routing. In this
Sorting and Selection on Interconnection Networks
- DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1995
"... ABSTRACT. In this paper we identify techniques that havebeen employed in the design of sorting and selection algorithms for various interconnection networks. We consider both randomized and deterministic techniques. Interconnection Networks of interest include the mesh, the mesh with xed and recon g ..."
Abstract
-
Cited by 20 (14 self)
- Add to MetaCart
ABSTRACT. In this paper we identify techniques that havebeen employed in the design of sorting and selection algorithms for various interconnection networks. We consider both randomized and deterministic techniques. Interconnection Networks of interest include the mesh, the mesh with xed and recon gurable buses, the hypercube family, and the star graph. For the sake of comparisons, we also list PRAM algorithms. 1

