Results 1  10
of
59
A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms pro ..."
Abstract

Cited by 188 (7 self)
 Add to MetaCart
(Show Context)
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Direct BulkSynchronous Parallel Algorithms
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1992
"... We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronisation. We do this for the bulksynchronous paralle ..."
Abstract

Cited by 173 (27 self)
 Add to MetaCart
We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronisation. We do this for the bulksynchronous parallel (BSP) model, which abstracts the characteristics of a parallel machine into three numerical parameters p, g, and L, corresponding to processors, bandwidth, and periodicity respectively. The model differentiates memory that is local to a processor from that which is not, but, for the sake of universality, does not differentiate network proximity. The advantages of this model in supporting shared memory or PRAM style programming have been treated elsewhere. Here we emphasise the viability of an alternative direct style of programming where, for the sake of efficiency the programmer retains control of memory allocation. We show that optimality to within a multiplicative factor close to one ca...
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix ..."
Abstract

Cited by 120 (5 self)
 Add to MetaCart
We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix sort is up to 4 times faster than the graphicsbased GPUSort and greater than 2 times faster than other CUDAbased radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial finegrained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the highspeed onchip shared memory provided by NVIDIA’s GPU architecture and efficient dataparallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
A Limit Theorem for "Quicksort"
 Applications/Theoretical Informatics and Applications
, 1999
"... Let X n be the number of comparisons needed by the sorting algorithm Quicksort to sort a list of n numbers into their natural ordering. We show that (X n \Gamma E(X n ))=n converges weakly to some random variable Y. The distribution of Y is characterized as the fixed point of some contraction. It sa ..."
Abstract

Cited by 103 (2 self)
 Add to MetaCart
Let X n be the number of comparisons needed by the sorting algorithm Quicksort to sort a list of n numbers into their natural ordering. We show that (X n \Gamma E(X n ))=n converges weakly to some random variable Y. The distribution of Y is characterized as the fixed point of some contraction. It satisfies a recursive equation, which is used to provide recursive relations for the moments. The random variable Y has exponential tails. Therefore the probability that Quicksort performs badly, e.g. that X n is larger than 2E(X n ) converges polynomially fast of every order to zero. R'esum'e Soit X n le nombre de comparaisons utilis'ees par la proc'edure Quicksort pour trier une liste de nombres distincts. Nous d'emontrons que (X n \Gamma E(X n ))=n converge faiblement vers une certaine variable al'eatoire Y. La distribution de Y est le point fixe d'une contraction et peut etre calcul'ee num'eriquement par it'eration. Keywords: sorting algorithm quicksort, fixed point, asymptotic distribut...
Heuristic Search Viewed as Path Finding in a Graph
 ARTIFICIAL INTELLIGENCE
, 1970
"... This paper presents a particular model of heuristic search as a pathfinding problem in a directed graph. A class of graphsearching procedures is described which uses a heuristic function to guide search. Heuristic functions are estimates of the number of edges that remain to be traversed in reachi ..."
Abstract

Cited by 98 (0 self)
 Add to MetaCart
This paper presents a particular model of heuristic search as a pathfinding problem in a directed graph. A class of graphsearching procedures is described which uses a heuristic function to guide search. Heuristic functions are estimates of the number of edges that remain to be traversed in reaching a goal node. A number of theoretical results for this model, and the intuition for these results, are presented. They relate the e])~ciency of search to the accuracy of the heuristic function. The results also explore efficiency as a consequence of the reliance or weight placed on the heuristics used.
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract

Cited by 93 (8 self)
 Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.
Parallel sorting on a sharednothing architecture using probabilistic splitting
, 1991
"... We consider the problem of external sorting in a sharednothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a paralle ..."
Abstract

Cited by 87 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of external sorting in a sharednothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles. We present analytic results showing that probabilistic splitting performs better than exact splitting. Finally, we present experimental results from an implementation of sorting via probabilistic splitting in the Gamma parallel database machine.
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
A Survey of Adaptive Sorting Algorithms
, 1992
"... Introduction and Survey; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems  Sorting and Searching; E.5 [Data]: Files  Sorting/searching; G.3 [Mathematics of Computing]: Probability and Statistics  Probabilistic algorithms; E.2 [Data Storage Represe ..."
Abstract

Cited by 74 (3 self)
 Add to MetaCart
Introduction and Survey; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems  Sorting and Searching; E.5 [Data]: Files  Sorting/searching; G.3 [Mathematics of Computing]: Probability and Statistics  Probabilistic algorithms; E.2 [Data Storage Representation]: Composite structures, linked representations. General Terms: Algorithms, Theory. Additional Key Words and Phrases: Adaptive sorting algorithms, Comparison trees, Measures of disorder, Nearly sorted sequences, Randomized algorithms. A Survey of Adaptive Sorting Algorithms 2 CONTENTS INTRODUCTION I.1 Optimal adaptivity I.2 Measures of disorder I.3 Organization of the paper 1.WORSTCASE ADAPTIVE (INTERNAL) SORTING ALGORITHMS 1.1 Generic Sort 1.2 CookKim division 1.3 Partition Sort 1.4 Exponential Search 1.5 Adaptive Merging 2.EXPECTEDCASE ADAPTIV
Implementations of Randomized Sorting on Large Parallel Machines
, 1992
"... Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routin ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routing. In this paper we present BFlashsort, a new batchedrouting variant of Flashsort designed to sort N>P values using P processors connected in a ddimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of BFlashsort over Flashsort is that it replaces pipelined splitterdirected routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of BFlashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a ddimensional toroidal mesh BFlashsort improves on Samplesort when (N/P)ּ<ּP/(c 1log P +c 2dP 1/d +c 3), for machinedependent parameters c 1, c 2, and c 3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP1 of Samplesort and two BFlashsort variants.