Results 1  10
of
80
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
On the Versatility of Parallel Sorting by Regular Sampling
 Parallel Computing
, 1993
"... Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot ..."
Abstract

Cited by 48 (14 self)
 Add to MetaCart
Parallel sorting algorithms have already been proposed for a variety of multiple instruction streams, multiple data streams (MIMD) architectures. These algorithms often exploit the strengths of the particular machine to achieve high performance. In many cases, however, the existing algorithms cannot achieve comparable performance on other architectures. Parallel Sorting by Regular Sampling (PSRS) is an algorithm that is suitable for a diverse range of MIMD architectures. It has good load balancing properties, modest communication needs and good memory locality of reference. If there are no duplicate keys, PSRS guarantees to balance the work among the processors within a factor of two of optimal in theory, regardless of the data value distribution, and within a few percent of optimal in practice. This paper presents new theoretical and empirical results for PSRS. The theoretical analysis of PSRS is extended to include a lower bound and a tighter upper bound on the work done by a process...
Deterministic Sorting and Randomized Median Finding on the BSP model
, 1996
"... We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (oneoptimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the ste ..."
Abstract

Cited by 48 (23 self)
 Add to MetaCart
We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (oneoptimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the step distance, which can be adapted to meet the worstcase requirements of our application). Although we employ sampling in order to realize efficiency, we can give a precise worstcase estimation of the maximum imbalance which might occur. We also investigate optimal randomized BSP algorithms for the problem of finding the median of n elements that require, with highprobability, 3n=(2p) + o(n=p) number of comparisons, for a wide range of values of n and p. Experimental results for the two algorithms are also presented.
Using Generative Design Patterns to Generate Parallel Code for a Distributed Memory Environment
, 2003
"... A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a reusable artifact. Parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO 2 P 3 S (Correct ObjectOriented Patternbased Parallel Program ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a reusable artifact. Parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO 2 P 3 S (Correct ObjectOriented Patternbased Parallel Programming System) and MetaCO 2 P 3 S, use generative design patterns. A programmer selects the parallel design patterns that are appropriate for an application, and then adapts the patterns for that specific application by selecting from a small set of codeconfiguration options. CO 2 P 3 S then generates a custom framework for the application that includes all of the structural code necessary for the application to run in parallel. The programmer is only required to write simple code that launches the application and to fill in some applicationspecific sequential hook routines. We use generative design patterns to take an application specification (parallel design patterns + sequential user code) and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments. Although our implementations are for Java, the approach we describe is tool and language independent. This paper describes generalizing CO 2 P 3 S to generate distributedmemory parallel solutions.
The Block Distributed Memory Model
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1994
"... We introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
We introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speedup in computational complexity.
A New Deterministic Parallel Sorting Algorithm With an Experimental Evaluation
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS
, 1996
"... We introduce a new deterministic parallel sorting algorithm based on the regular sampling approach. The algorithm uses only two rounds of regular alltoall personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, o ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
We introduce a new deterministic parallel sorting algorithm based on the regular sampling approach. The algorithm uses only two rounds of regular alltoall personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SplitC and run on a variety of platforms, including the Thinking Machines CM5, the IBM SP2WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platfo...
Designing Practical Efficient Algorithms for Symmetric Multiprocessors (Extended Abstract)
 IN ALGORITHM ENGINEERING AND EXPERIMENTATION (ALENEX’99
, 1999
"... Symmetric multiprocessors (SMPs) dominate the highend server market and are currently the primary candidate for constructing large scale multiprocessor systems. Yet, the design of efficient parallel algorithms for this platform currently poses several challenges. In this paper, we present a comput ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Symmetric multiprocessors (SMPs) dominate the highend server market and are currently the primary candidate for constructing large scale multiprocessor systems. Yet, the design of efficient parallel algorithms for this platform currently poses several challenges. In this paper, we present a computational model for designing efficient algorithms for symmetric multiprocessors. We then use this model to create efficient solutions to two widely different types of problems  linked list prefix computations and generalized sorting. Our novel algorithm for prefix computations builds upon the sparse ruling set approach of ReidMiller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, we can bound our complexity with high probabi...
Parallel Program Archetypes
 In Proceedings of the Scalable Parallel Library Conference
, 1997
"... A parallel program archetype is an abstraction that captures the common features of a class of problems with similar computational structure and combines them with a parallelization strategy to produce a pattern of dataflow and communication. Such abstractions are useful in application developme ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
A parallel program archetype is an abstraction that captures the common features of a class of problems with similar computational structure and combines them with a parallelization strategy to produce a pattern of dataflow and communication. Such abstractions are useful in application development, both as a conceptual framework and as a basis for tools and techniques. This paper describes an approach to parallel application development based on archetypes and presents two example archetypes with applications. 1 Introduction This paper proposes a specific method of exploiting computational and dataflow patterns to help in developing reliable parallel programs. A great deal of work has been done on methods of exploiting design patterns in program development. This paper restricts attention to one kind of pattern that is relevant in parallel programming: the pattern of the parallel computation and communication structure. Methods of exploiting design patterns in program develop...
Fast and Parallel Mapping Algorithms for Irregular Problems
 Journal of Supercomputing
, 1993
"... In this paper we develop simple indexbased graph partitioning techniques. We show our methods to be very fast, easily parallelizable and that they produce good quality mappings. These properties make them useful for parallelization of a number of irregular and adaptive applications. Index Terms: M ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
In this paper we develop simple indexbased graph partitioning techniques. We show our methods to be very fast, easily parallelizable and that they produce good quality mappings. These properties make them useful for parallelization of a number of irregular and adaptive applications. Index Terms: Mapping, Remapping, Parallel, Merging, Sorting 1 Introduction Parallelization of dataparallel programs on distributedmemory parallel computers requires careful attention to load balancing and reduction of communication to achieve a good performance. For most regular and synchronous problems [13], mapping can be performed at the time of compilation by giving directives to decompose the data and its corresponding computations [8]. For irregular applications, achieving a good mapping is considerably more difficult; the nature of the irregularities may not be known at the time of compilation and can be derived only at runtime [7]. These applications can be represented as computational graphs...