Results 1 
8 of
8
Parallel Sorting by Regular Sampling
, 1992
"... A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p process ..."
Abstract

Cited by 101 (7 self)
 Add to MetaCart
A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p processors, when n p 3 the algorithm is shown to be asymptotically optimal. In theory, the algorithm is within a factor of two of achieving ideal load balancing. In practice, there is almost perfect partitioning of work. On a variety of shared and distributed memory machines, the algorithm achieves better than halflinear speedups.  1. Introduction Sorting is one of most studied problems in computer science because of its theoretical interest and practical importance. With the advent of parallel processing, parallel sorting has become an important area for algorithm research. Although considerable work has been done on the theory of parallel sorting and efficient implementations on SIMD arch...
Parallel sorting on a sharednothing architecture using probabilistic splitting
, 1991
"... We consider the problem of external sorting in a sharednothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a paralle ..."
Abstract

Cited by 81 (1 self)
 Add to MetaCart
We consider the problem of external sorting in a sharednothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by each processor. We consider two techniques for determining these ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles. We present analytic results showing that probabilistic splitting performs better than exact splitting. Finally, we present experimental results from an implementation of sorting via probabilistic splitting in the Gamma parallel database machine.
On the Versatility of Parallel Sorting by Regular Sampling
 Parallel Computing
, 1993
"... Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot ..."
Abstract

Cited by 48 (14 self)
 Add to MetaCart
Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot achieve comparable performance on other architectures. Parallel Sorting by Regular Sampling (PSRS) is an algorithm that is suitable for a diverse range of MIMD architectures. It has good load balancing properties, modest communication needs and good memory locality of reference. If there are no duplicate keys, PSRS guarantees to balance the work among the processors within a factor of two of optimal in theory, regardless of the data value distribution, and within a few percent of optimal in practice. This paper presents new theoretical and empirical results for PSRS. The theoretical analysis of PSRS is extended to include a lower bound and a tighter upper bound on the work done by a process...
Empirical Analysis of Overheads in Cluster Environments
 CONCURRENCY: PRACTICE AND EXPERIENCE
, 1995
"... In concurrent computing environments that are based on heterogeneous processing elements interconnected by generalpurpose networks, several classes of overheads contribute to lowered performance. In an attempt to gain a deeper insight into the exact nature of these overheads, and to develop stra ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
In concurrent computing environments that are based on heterogeneous processing elements interconnected by generalpurpose networks, several classes of overheads contribute to lowered performance. In an attempt to gain a deeper insight into the exact nature of these overheads, and to develop strategies to alleviate them, we have conducted empirical studies of selected applications representing different classes of concurrent programs. These analyses have identified load imbalance, the parallelism model adopted, communication delay and throughput, and system factors as the primary factors affecting performance in cluster environments. Based on the degree to which these factors affect specific classes of applications, we propose a combination of model selection criteria, partitioning strategies, and software system heuristics to reduce overheads and enhance performance in network based environments. We demonstrate that agenda parallelism and load balancing strategies contribu...
Speeding up External Mergesort
 IEEE Transactions on Knowledge and Data Engineering
"... External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Int ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, inte...
Software Caching on CacheCoherent Multiprocessors
 In Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing
, 1992
"... Programmers have always been concerned with data distribution and remote memory access costs on sharedmemory multiprocessors that lack coherent caches, like the BBN Butterfly. Recently memory latency has become an important issue on cachecoherent multiprocessors, where dramatic improvements in micr ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Programmers have always been concerned with data distribution and remote memory access costs on sharedmemory multiprocessors that lack coherent caches, like the BBN Butterfly. Recently memory latency has become an important issue on cachecoherent multiprocessors, where dramatic improvements in microprocessor performance have increased the relative cost of cache misses and coherency transactions. The trend towards a deep memory hierarchy in cachecoherent multiprocessors suggests that techniques used to improve the locality of reference on a machine lacking coherent caches might be useful on a cachecoherent machine. In this paper we explore the utility of software caching on a machine with coherent caches. In particular, we show that by caching at the application level we can avoid the problem of false sharing on cachecoherent machines. We compare the performance of software caching with other techniques for alleviating false sharing, and show that software caching performs better th...
Overlapping Computations, Communications and I/O in Parallel Sorting
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known efficient sorting algorithms. The pipelining effect exploited by our algorithm should ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known efficient sorting algorithms. The pipelining effect exploited by our algorithm should lead to higher levels of performance on distributed memory parallel processors. In order to achieve the best results using this strategy, the CPU, network and disk operations must take comparable time. We suggest acceptable levels of system balance for sorting machines and analyze the performance of the sorting algorithm as system parameters vary.
Parallel Sorting of Large Data Volumes on Distributed Memory Multiprocessors
 In Arndt Bode and Mario Dal Cin, editors, Parallel Computer Architectures: Theory, Hardware, Software, Applications
, 1993
"... . The use of multiprocessor architectures requires the parallelization of sorting algorithms. A parallel sorting algorithm based on horizontal parallelization is presented. This algorithm is suited for large data volumes (external sorting) and does not suffer from processing skew in presence of data ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
. The use of multiprocessor architectures requires the parallelization of sorting algorithms. A parallel sorting algorithm based on horizontal parallelization is presented. This algorithm is suited for large data volumes (external sorting) and does not suffer from processing skew in presence of data skew. The core of the parallel sorting algorithm is a new adaptive partitioning method. The effect of data skew is remedied by taking samples representing the distribution of the input data. The parallel algorithm has been implemented on top of a shared disk multiprocessor architecture. The performance evaluation of the algorithm shows that it has linear speedup. Furthermore, the optimal degree of CPU parallelism is derived if I/O limitations are taken into account. 1 Introduction Data sorting plays an important role in computer science and has been studied extensively [23]. The problem of sorting is easily understood. In the sequential case, sorting of tuples (i.e. data items) has at lea...