Results 11  20
of
58
An Experimental Analysis of Parallel Sorting Algorithms
 THEORY OF COMPUTING SYSTEMS
, 1998
"... We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Ne ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batcher’s bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant’s flashsort, and implemented them on the connection Machine model CM2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.
Speeding up External Mergesort
 IEEE Transactions on Knowledge and Data Engineering
"... External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Int ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, inte...
Communication Efficient Data Structures on the BSP model with Applications
 IN PROCEEDINGS OF EUROPAR'96
, 1996
"... The implementation of data structures on distributed memory models such as the BulkSynchronous Parallel (BSP) model, rather than shared memory ones such as the Parallel Random Access Machine (PRAM), offers a serious challenge. In this work we undertake the architecture independent study of the comp ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
The implementation of data structures on distributed memory models such as the BulkSynchronous Parallel (BSP) model, rather than shared memory ones such as the Parallel Random Access Machine (PRAM), offers a serious challenge. In this work we undertake the architecture independent study of the computation and communication requirements of searching ordered hlevel graphs, which include many of the standard data structures. We propose multiway search as a general tool for the design, analysis and implementation of BSP algorithms. This technique allows elegant highlevel design and analysis of algorithms, using data structures similar to those of sequential models. Applications to computational geometry and sorting are also presented. In particular, our new randomized sorting algorithm improves previously known BSP randomized sorting algorithms upon the amount of parallel slackness required to achieve optimality. Moreover, our methods are within a 1 + o(1) multiplicative factor of the ...
An Implementation of a GeneralPurpose Parallel Sorting Algorithm
, 1993
"... A parallel sorting algorithm is presented for general purpose internal sorting on MIMD machines. ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
A parallel sorting algorithm is presented for general purpose internal sorting on MIMD machines.
A Randomized Sorting Algorithm on the BSP model
 IN PROCEEDINGS OF IPPS
, 1997
"... We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search al ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search algorithms for a class of data structures on the BSP model. Moreover, our methods are within a 1+o(1) multiplicative factor of the respective sequential methods in terms of speedup for a wide range of the BSP parameters.
Logarithmic time cost optimal parallel sorting is not yet fast in practice
 August), Dept. of Computer Science, Brown University
, 1990
"... When looking for new and faster parallel sorting algorithms for use in massively parallel systems it is tempting to investigate promising alternatives from the large body of research doneon parallel sorting in the eld of theoretical computer science. Such \theoretical " algorithms are mainly describ ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
When looking for new and faster parallel sorting algorithms for use in massively parallel systems it is tempting to investigate promising alternatives from the large body of research doneon parallel sorting in the eld of theoretical computer science. Such \theoretical " algorithms are mainly described for the PRAM (Parallel Random Access Machine) model of computation [13, 26]. This paper shows how this kind of investigation can be done on a simple but versatile environment forprogramming and measuring of PRAM algorithms [18, 19]. The practical value of Cole's Parallel Merge Sort algorithm [10,11] have beeninvestigated by comparing it with Batcher's bitonic sorting [5]. The O(log n) time consumption of Cole's algorithm implies that it must be faster than bitonic sorting which is O(log 2 n) timeif n is large enough. However, we havefound that bitonic sorting is faster as long as n is less than 1:2 1021, i.e. more than 1 Giga Tera items!. Consequently, Cole's logarithmic time algorithm is not fast in practice. 1Introduction and Motivation The work reported in this paper is an attempt to lessen the gap between theory and practice within the eld of parallel computing. Within theoretical computer science, parallel algorithms are mainly compared by using asymptotical analysis (Onotation). This paper gives an example on how the analysis of implemented algorithms on nite problems provides new and more practically oriented results than those traditionally obtained by asymptotical analysis. Parallel Complexity TheoryA Rich Source for
Communicable Memory and Lazy Barriers for Bulk Synchronous Parallelism in BSPk
, 1996
"... Communication and synchronization stand as the dual bottlenecks in the performance of parallel systems, and especially those that attempt to alleviate the programming burden by incurring overhead in these two domains. We formulate the notions of communicable memory and lazy barriers to help achi ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Communication and synchronization stand as the dual bottlenecks in the performance of parallel systems, and especially those that attempt to alleviate the programming burden by incurring overhead in these two domains. We formulate the notions of communicable memory and lazy barriers to help achieve efficient communication and synchronization. These concepts are developed in the context of BSPk, a toolkit library for programming networks of workstationsand other distributed memory architectures in generalbased on the Bulk Synchronous Parallel (BSP) model. BSPk emphasizes efficiency in communication by minimizing local memorytomemory copying, and in barrier synchronization by not forcing a process to wait unless it needs remote data. Both the message passing (MP) and distributed shared memory (DSM) programming styles are supported in BSPk. MP helps processes efficiently exchange shortlived unnamed data values, when the identity of either the sender or receiver is known to the other party. By contrast, DSM supports communication between processes that maybemutually anonymous, so long as they can agree on variable names in which to store shared temporary or longlived data.
Tight Comparison Bounds On The Complexity Of Parallel Sorting
, 1987
"... The problem of sorting n elements using p processors in a parallel comparison model is considered. Lower and upper bounds which imply that for p ³ n, the time complexity of this problem is Q( log(1 + p / n) logn ___________ ) are presented. This complements [AKS83] in settling the problem since ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
The problem of sorting n elements using p processors in a parallel comparison model is considered. Lower and upper bounds which imply that for p ³ n, the time complexity of this problem is Q( log(1 + p / n) logn ___________ ) are presented. This complements [AKS83] in settling the problem since the AKS sorting network established that for pn the time complexity is Q( p nlogn ______ ). To prove the lower bounds we show that to achieve k logn parallel time, we need W(n 1 + 1/k ) processors. 1. Introduction Apparently, there is no problem in Computer Science which received more attention than sorting. [Kn73], for instance, found that existing computers devote approximately a quarter of their time to sorting. The advent of parallel computers stimulated intensive research of the sorting with respect to various models of parallel computation. Extensive lists of references which recorded this activity are given in [Ak85], [BHe86] and [Th83]. Most of the fastest serial and paral...
ManytoMany Routing on Trees via Matchings
, 1996
"... In this paper we present an extensive study of manytomany routing on trees under the matching routing model. Our study includes online and offline algorithms. We present an asymptotically optimal online algorithm which routes k packets to their destination within d(k \Gamma 1) + d \Delta dist r ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
In this paper we present an extensive study of manytomany routing on trees under the matching routing model. Our study includes online and offline algorithms. We present an asymptotically optimal online algorithm which routes k packets to their destination within d(k \Gamma 1) + d \Delta dist routing steps, where d is the degree of tree T on which the routing takes place and dist is the maximum distance any packet has to travel. We also present an offline algorithm that solves the same problem within 2(k \Gamma 1)+dist steps. The analysis of our algorithms is based on the establishment of a close relationship between the matching and the hotpotato routing models that allows us to apply tools which were previously used exclusively in the analysis of hotpotato routing.
How to Sort N items using a sorting network of fixed I/O size
, 1999
"... Sorting networks of a fixed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large data sets has not been addressed in the literature. Our main contribution is to propose a si ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Sorting networks of a fixed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large data sets has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of fixed I/O size p to sort an arbitrarily large data set of N elements. A noteworthy feature of our design is that no extra data memory space is required, other than what is used for storing the input. As it turns out, our architecture is feasible for VLSI implementation and its time performance is virtually independent of the cost and depth of the underlying sorting network. Specifically, we show that by using our design N elements can be sorted in ) time without memory access conflicts. Finally, we show how to use an AT optimal sorting network of fixed I/O size p to construct a similar architecture that sorts N elements in Key Words: computer architecture, sorting, parallel processing, pipelined processing, sorting networks.