A Comparison of Sorting Algorithms for the Connection Machine CM2
"... We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms pro ..."
We have implemented three parallel sorting algorithms on the Connection Machine Supercomputer model CM2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's flashsort. We have also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that the sample sort algorithm, which is a theoretically efficient "randomized" algorithm, is the fastest of the three algorithms on large data sets. On a 64Kprocessor CM2, our sample sort implementation can sort 32 10 6 64bit keys in 5.1 seconds, which is over 10 times faster than the CM2 library sort. Our implementation of radix sort, although not as fast on large data sets, is deterministic, much simpler to code, stable, faster with small keys, and faster on small data sets (few elements per processor). Our implementation of bitonic sort, which is pipelined to use all the hypercube wires simultaneously, is the least efficient of the three on large data sets, but is the most efficient on small data sets, and is considerably more space efficient. This paper analyzes the three algorithms in detail and discusses many practical issues that led us to the particular implementations.
Models of Computation  Exploring the Power of Computing
"... Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and oper ..."
Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and operating systems were under development and therefore became both the subject and basis for a great deal of theoretical work. The power of computers of this period was limited by slow processors and small amounts of memory, and thus theories (models, algorithms, and analysis) were developed to explore the efficient use of computers as well as the inherent complexity of problems. The former subject is known today as algorithms and data structures, the latter computational complexity. The focus of theoretical computer scientists in the 1960s on languages is reflected in the first textbook on the subject, Formal Languages and Their Relation to Automata by John Hopcroft and Jeffrey Ullman. This influential book led to the creation of many languagecentered theoretical computer science courses; many introductory theory courses today continue to reflect the content of this book and the interests of theoreticians of the 1960s and early 1970s. Although
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Scalable Parallel Computational Geometry for Coarse Grained Multicomputers
 International Journal on Computational Geometry
, 1994
"... We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O( n p ) AE O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hype ..."
We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O( n p ) AE O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O( Tsequential p + T s (n; p)) time scalable parallel algorithms for several computational geometry problems. T s (n; p) refers to the time of a global sort operation. Our results are independent of the multicomputer's interconnection network. Their time complexities become optimal when Tsequential p dominates T s (n; p) or when T s (n; p) is optimal. This is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n p that include many of the currently available machine configurations. Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed numb...
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Fast Algorithms for BitSerial Routing on a Hypercube
, 1991
"... In this paper, we describe an O(log N)bitstep randomized algorithm for bitserial message routing on a hypercube. The result is asymptotically optimal, and improves upon the best previously known algorithms by a logarithmic factor. The result also solves the problem of online circuit switching in ..."
In this paper, we describe an O(log N)bitstep randomized algorithm for bitserial message routing on a hypercube. The result is asymptotically optimal, and improves upon the best previously known algorithms by a logarithmic factor. The result also solves the problem of online circuit switching in an O(1)dilated hypercube (i.e., the problem of establishing edgedisjoint paths between the nodes of the dilated hypercube for any onetoone mapping). Our algorithm is adaptive and we show that this is necessary to achieve the logarithmic speedup. We generalize the BorodinHopcroft lower bound on oblivious routing by proving that any randomized oblivious algorithm on a polylogarithmic degree network requires at least \Omega\Gammaast 2 N= log log N) bit steps with high probability for almost all permutations. 1 Introduction Substantial effort has been devoted to the study of storeandforward packet routing algorithms for hypercubic networks. The fastest algorithms are randomized, and c...
Packet Routing In FixedConnection Networks: A Survey
, 1998
"... We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing ..."
We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
Derandomizing Algorithms for Routing and Sorting on Meshes
 Proc. 5th Symp. on Discrete Algorithms
, 1994
"... We describe a new technique that can be used to derandomize a number of randomized algorithms for routing and sorting on meshes. We demonstrate the power of this technique by deriving improved deterministic algorithms for a variety of routing and sorting problems. Our main results are an optimal alg ..."
We describe a new technique that can be used to derandomize a number of randomized algorithms for routing and sorting on meshes. We demonstrate the power of this technique by deriving improved deterministic algorithms for a variety of routing and sorting problems. Our main results are an optimal algorithm for kk routing on multidimensional meshes, a permutation routing algorithm with running time 2n+o(n) and queue size 5, and an optimal algorithm for 11 sorting. 1 Introduction One of the main problems in the simulation of idealistic parallel computers by realistic ones is the problem of message routing through the sparse network of links connecting a set of processing units (PUs) among each other. In this paper, we consider the case of the n \Theta n mesh, in which n 2 PUs are connected by a regular twodimensional grid of bidirectional communication links. There may also be additional wraparound connections between the two PUs at opposite ends of each row and each column of t...
Implementations of Randomized Sorting on Large Parallel Machines
, 1992
"... Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routin ..."
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitterdirected routing. In this paper we present BFlashsort, a new batchedrouting variant of Flashsort designed to sort N>P values using P processors connected in a ddimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of BFlashsort over Flashsort is that it replaces pipelined splitterdirected routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of BFlashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a ddimensional toroidal mesh BFlashsort improves on Samplesort when (N/P)ּ<ּP/(c 1log P +c 2dP 1/d +c 3), for machinedependent parameters c 1, c 2, and c 3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP1 of Samplesort and two BFlashsort variants.
Sorting and Selection on Interconnection Networks
 DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1995
"... ABSTRACT. In this paper we identify techniques that havebeen employed in the design of sorting and selection algorithms for various interconnection networks. We consider both randomized and deterministic techniques. Interconnection Networks of interest include the mesh, the mesh with xed and recon g ..."
ABSTRACT. In this paper we identify techniques that havebeen employed in the design of sorting and selection algorithms for various interconnection networks. We consider both randomized and deterministic techniques. Interconnection Networks of interest include the mesh, the mesh with xed and recon gurable buses, the hypercube family, and the star graph. For the sake of comparisons, we also list PRAM algorithms. 1