Results 11  20
of
37
An Experimental Analysis of Parallel Sorting Algorithms
 THEORY OF COMPUTING SYSTEMS
, 1998
"... We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Ne ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batcher’s bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant’s flashsort, and implemented them on the connection Machine model CM2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.
Communication Efficient Data Structures on the BSP model with Applications
 IN PROCEEDINGS OF EUROPAR'96
, 1996
"... The implementation of data structures on distributed memory models such as the BulkSynchronous Parallel (BSP) model, rather than shared memory ones such as the Parallel Random Access Machine (PRAM), offers a serious challenge. In this work we undertake the architecture independent study of the comp ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
The implementation of data structures on distributed memory models such as the BulkSynchronous Parallel (BSP) model, rather than shared memory ones such as the Parallel Random Access Machine (PRAM), offers a serious challenge. In this work we undertake the architecture independent study of the computation and communication requirements of searching ordered hlevel graphs, which include many of the standard data structures. We propose multiway search as a general tool for the design, analysis and implementation of BSP algorithms. This technique allows elegant highlevel design and analysis of algorithms, using data structures similar to those of sequential models. Applications to computational geometry and sorting are also presented. In particular, our new randomized sorting algorithm improves previously known BSP randomized sorting algorithms upon the amount of parallel slackness required to achieve optimality. Moreover, our methods are within a 1 + o(1) multiplicative factor of the ...
A Randomized Sorting Algorithm on the BSP model
 IN PROCEEDINGS OF IPPS
, 1997
"... We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search al ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search algorithms for a class of data structures on the BSP model. Moreover, our methods are within a 1+o(1) multiplicative factor of the respective sequential methods in terms of speedup for a wide range of the BSP parameters.
Sorting Large Data Sets on a Massively Parallel System
 IN PROCEEDINGS OF THE 6TH IEEE SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING (SPDP
, 1994
"... This paper presents a performance study for many of today's popular parallel sorting algorithms. It is the first to present a comparative study on a large scale MIMD system. The machine, a Parsytec GCel, contains 1024 processors connected as a twodimensional grid. To justify the experimental result ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
This paper presents a performance study for many of today's popular parallel sorting algorithms. It is the first to present a comparative study on a large scale MIMD system. The machine, a Parsytec GCel, contains 1024 processors connected as a twodimensional grid. To justify the experimental results, we develop a theoretical model to predict the performance in terms of communication and computation times. We get a very close relation between the experiments and the theoretical model as long as the edge congestion caused by the algorithms is predicted precisely. We compare: Bitonicsort, Shearsort, Gridsort, Samplesort, and Radixsort. Experiments were performed using random instances according to a well known benchmark problem. Results show that for the machine we used, Bitonicsort performs best for smaller numbers of keys per processor (! 2048) and Samplesort outperforms all other methods for larger instances.
Efficient Deterministic Sorting on the BSP Model
 PARALLEL PROCESSING LETTERS
, 1996
"... We present a new algorithm for deterministic sorting on the BulkSynchronous Parallel (BSP) model of computation. We sort n general keys using a partitioning scheme that achieves the requirements of efficiency (1optimality) and insensitivity against data skew. Although we employ sampling in order t ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
We present a new algorithm for deterministic sorting on the BulkSynchronous Parallel (BSP) model of computation. We sort n general keys using a partitioning scheme that achieves the requirements of efficiency (1optimality) and insensitivity against data skew. Although we employ sampling in order to realize efficiency, we can give a precise worstcase estimation of the maximum imbalance which might occur. The algorithm is 1optimal for a wide range of the BSP parameters in the sense that its speedup on p processors is asymptotically (1 \Gamma o(1))p. Experimental results for the algorithm are also presented.
Portability of performance with the BSPLib communications library
 In Programming Models for Massively Parallel Computers, (MPPM'97
, 1997
"... The BSP cost model makes a new level of power available for designing parallel algorithms. First, it models the actual behaviour of today's parallel computers, and so can be used to choose appropriate algorithms without completely implementing them. Second, it becomes possible to characterise the ra ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The BSP cost model makes a new level of power available for designing parallel algorithms. First, it models the actual behaviour of today's parallel computers, and so can be used to choose appropriate algorithms without completely implementing them. Second, it becomes possible to characterise the range of architecture performance over which a particular algorithm is the best choice. This provides the foundations for developing software that is both portable at the source code level, and in its expectation of performance. We illustrate this by comparing three possible implementations of broadcast, and show that a twophase broadcast algorithm outperforms other techniques whenever the size of the data is large relative to the cost of synchronisation, and that broadcasting using trees is never a good technique (despite its continued popularity). We carry out a similar analysis for samplesort, and show that samplesort cannot perform well on networks of workstations unless the network bandw...
Sorting on a Massively Parallel System Using a Library of Basic Primitives: Modeling and Experimental Results
 In Proc. of the 3rd European Conference in Parallel Processing (EuroPar
, 1996
"... We present a comparative study of implementations of the following sorting algorithms on the Parsytec SC320 reconfigurable, asynchronous, massively parallel MIMD machine: Bitonic Sort, OddEven Merge Sort, OddEven Merge Sort with guarded split&merge, Periodic Balanced Sort, Columnsort, and two vari ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We present a comparative study of implementations of the following sorting algorithms on the Parsytec SC320 reconfigurable, asynchronous, massively parallel MIMD machine: Bitonic Sort, OddEven Merge Sort, OddEven Merge Sort with guarded split&merge, Periodic Balanced Sort, Columnsort, and two variants of Samplesort. The experiments are performed on 2 up to 5dimensional wrapped butterfly networks with 8 up to 160 processors. We make use of library functions that provide primitives for global variables and synchronization, and we show that it is possible to implement efficient and portable programs easily. We assume that the time for accessing a global variable is linear in the parameters s, d, and c, where s is the size of the variable, d the distance between the accessing processor and the processor holding the variable, and c the contention, i. e., the number of processors accessing the variable simultaneously. Therefore, to predict the performance, we model the runtime of this ac...
Efficient Oblivious Parallel Sorting on the MasPar MP1
 In Proc. 30th IEEE HICSS
, 1997
"... We address the problem of sorting a large number N of keys on a MasPar MP1 parallel SIMD machine of moderate size P where the processing elements (PEs) are interconnected as a toroidal mesh and have 16KB local storage each. We present a comparative study of implementations of the following determin ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We address the problem of sorting a large number N of keys on a MasPar MP1 parallel SIMD machine of moderate size P where the processing elements (PEs) are interconnected as a toroidal mesh and have 16KB local storage each. We present a comparative study of implementations of the following deterministic oblivious sorting methods: Bitonic Sort, OddEven Merge Sort, and FastSort. We successfully use the guarded split&merge operation introduced by Rub. The experiments and investigations in a simple, parameterized, analytical model show that, with this operation, from a certain ratio N=P upwards both OddEven Merge Sort and FastSort become faster on average than the up to the present fastest, sophisticated implementation of Bitonic Sort by Prins. Though it is not as efficient as OddEven Merge Sort, FastSort is to our knowledge the first method specially tailored to the mesh architecture that can be, when implemented, competitive on average with a meshadaptation of Bitonic Sort for large ...
A cost model for communication on a symmetric multiprocessor
, 1998
"... In this paper we conduct an indepth study of the communication costs of programs when run on a typical Symmetric MultiProcessor, the SGI Power Challenge, characterized by powerful offtheshelf microprocessors communicating through a shared memory via a sharedbus interconnect. Our study is based o ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
In this paper we conduct an indepth study of the communication costs of programs when run on a typical Symmetric MultiProcessor, the SGI Power Challenge, characterized by powerful offtheshelf microprocessors communicating through a shared memory via a sharedbus interconnect. Our study is based on an extensive set of experiments designed to assess the relative impact of a number of parameters on the cost of shared memory accesses. We provide evidence that interaction with the memory hierarchy affects communication in such a substantial way that none of the models previously considered in the literature can guarantee a reasonable level of accuracy since they do not take this interaction into account. We then determine two prediction functions that are very accurate predictors of best and worst performance with respect to the memory hierarchy. These functions provide a prediction interval that can be employed to obtain lower and upper bounds on the actual communication cost of an application, and to evaluate the degree of locality of the memory access patterns involved.
A GeneralPurpose Model for Heterogeneous Computation
, 2000
"... Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel ap ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing efficient heterogeneous parallel programs. We propose