Results 1  10
of
119
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
On the Versatility of Parallel Sorting by Regular Sampling
 Parallel Computing
, 1993
"... Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot ..."
Abstract

Cited by 60 (14 self)
 Add to MetaCart
(Show Context)
Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot achieve comparable performance on other architectures. Parallel Sorting by Regular Sampling (PSRS) is an algorithm that is suitable for a diverse range of MIMD architectures. It has good load balancing properties, modest communication needs and good memory locality of reference. If there are no duplicate keys, PSRS guarantees to balance the work among the processors within a factor of two of optimal in theory, regardless of the data value distribution, and within a few percent of optimal in practice. This paper presents new theoretical and empirical results for PSRS. The theoretical analysis of PSRS is extended to include a lower bound and a tighter upper bound on the work done by a process...
From Patterns to Frameworks to Parallel Programs
, 2002
"... Parallel programming offers potentially large performance benefits for computationally intensive problems. Unfortunately, it is difficult to obtain these benefits because parallel programs are more complex than their sequential counterparts. One way to reduce this complexity is to use a parallel p ..."
Abstract

Cited by 48 (9 self)
 Add to MetaCart
Parallel programming offers potentially large performance benefits for computationally intensive problems. Unfortunately, it is difficult to obtain these benefits because parallel programs are more complex than their sequential counterparts. One way to reduce this complexity is to use a parallel programming system to write parallel programs. This dissertation shows a new approach to writing objectoriented parallel programs based on design patterns, frameworks, and multiple layers of abstraction. This approach is intended as the basis for a new generation of parallel programming systems. A critical
Deterministic Sorting and Randomized Median Finding on the BSP model
, 1996
"... We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (oneoptimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the ste ..."
Abstract

Cited by 48 (23 self)
 Add to MetaCart
We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (oneoptimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the step distance, which can be adapted to meet the worstcase requirements of our application). Although we employ sampling in order to realize efficiency, we can give a precise worstcase estimation of the maximum imbalance which might occur. We also investigate optimal randomized BSP algorithms for the problem of finding the median of n elements that require, with highprobability, 3n=(2p) + o(n=p) number of comparisons, for a wide range of values of n and p. Experimental results for the two algorithms are also presented.
Using Generative Design Patterns to Generate Parallel Code for a Distributed Memory Environment
, 2003
"... A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a reusable artifact. Parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO 2 P 3 S (Correct ObjectOriented Patternbased Parallel Program ..."
Abstract

Cited by 32 (11 self)
 Add to MetaCart
(Show Context)
A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a reusable artifact. Parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO 2 P 3 S (Correct ObjectOriented Patternbased Parallel Programming System) and MetaCO 2 P 3 S, use generative design patterns. A programmer selects the parallel design patterns that are appropriate for an application, and then adapts the patterns for that specific application by selecting from a small set of codeconfiguration options. CO 2 P 3 S then generates a custom framework for the application that includes all of the structural code necessary for the application to run in parallel. The programmer is only required to write simple code that launches the application and to fill in some applicationspecific sequential hook routines. We use generative design patterns to take an application specification (parallel design patterns + sequential user code) and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments. Although our implementations are for Java, the approach we describe is tool and language independent. This paper describes generalizing CO 2 P 3 S to generate distributedmemory parallel solutions.
Designing Practical Efficient Algorithms for Symmetric Multiprocessors (Extended Abstract)
 IN ALGORITHM ENGINEERING AND EXPERIMENTATION (ALENEX’99
, 1999
"... Symmetric multiprocessors (SMPs) dominate the highend server market and are currently the primary candidate for constructing large scale multiprocessor systems. Yet, the design of efficient parallel algorithms for this platform currently poses several challenges. In this paper, we present a comput ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Symmetric multiprocessors (SMPs) dominate the highend server market and are currently the primary candidate for constructing large scale multiprocessor systems. Yet, the design of efficient parallel algorithms for this platform currently poses several challenges. In this paper, we present a computational model for designing efficient algorithms for symmetric multiprocessors. We then use this model to create efficient solutions to two widely different types of problems  linked list prefix computations and generalized sorting. Our novel algorithm for prefix computations builds upon the sparse ruling set approach of ReidMiller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, we can bound our complexity with high probabi...
CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters
 International Journal of High Performance Computing Applications
, 2003
"... In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PCclu8(T9 based on CGM algo rithms. CGMgraph implements parallel methods for variou graph prob lems. Ou implementations of deterministic list ranking, Eu er tou con nected components, spanning forest, and ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PCclu8(T9 based on CGM algo rithms. CGMgraph implements parallel methods for variou graph prob lems. Ou implementations of deterministic list ranking, Eu er tou con nected components, spanning forest, and bipartite graph detection are, to ou r knowledge, the first e#cient implementations for PC clu sters.Ou library also inclu des CGMlib, a library of basic CGM tools su ch as sort ing, prefix su m, one to all broadcast, all to one gather, h Relation, all to all broadcast, array balancing, and CGM partitioning. Both libraries are available for download at http://cgm.dehne.net. 1
Optimizing MapReduce for Multicore Architectures
 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep
, 2010
"... MapReduce is a programming model for dataparallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduc ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
MapReduce is a programming model for dataparallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduce libraries such as Phoenix have demonstrated that applications written with MapReduce perform competitively with those written with Pthreads [11]. This paper explores the design of the MapReduce data structures for grouping intermediate key/value pairs, which is often a performance bottleneck on multicore processors. The paper finds the best choice depends on workload characteristics, such as the number of keys used by the application, the degree of repetition of keys, etc. This paper also introduces a new MapReduce library, Metis, with a compromise data structure designed to perform well for most workloads. Experiments with the Phoenix benchmarks on a 16core AMDbased server show that Metis ’ data structure performs better than simpler alternatives, including Phoenix. 1
A New Deterministic Parallel Sorting Algorithm With an Experimental Evaluation
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS
, 1996
"... We introduce a new deterministic parallel sorting algorithm based on the regular sampling approach. The algorithm uses only two rounds of regular alltoall personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, o ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
We introduce a new deterministic parallel sorting algorithm based on the regular sampling approach. The algorithm uses only two rounds of regular alltoall personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SplitC and run on a variety of platforms, including the Thinking Machines CM5, the IBM SP2WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platfo...