Results 1  10
of
13
Scans as Primitive Parallel Operations
 IEEE Transactions on Computers
, 1987
"... In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of ..."
Abstract

Cited by 157 (12 self)
 Add to MetaCart
In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of the effect of including in the PRAM models, such scan operations as unittime primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(lg n) factor, greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radixsort algorithm, a quicksort algorithm, a minimumspanning tree algorithm, a linedrawing algorithm and a mergi...
Models and Languages for Parallel Computation
 ACM COMPUTING SURVEYS
, 1998
"... We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architectureindependent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in ..."
Abstract

Cited by 135 (4 self)
 Add to MetaCart
We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architectureindependent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in 6 categories, depending on the level of abstraction they provide.
Randomized routing and sorting on fixedconnection networks
 Journal of Algorithms
, 1994
"... This paper presents a general paradigm for the design of packet routing algorithms for fixedconnection networks. Its basis is a randomized online algorithm for scheduling any set of N packets whose paths have congestion c on any boundeddegree leveled network with depth L in O(c + L + log N) steps ..."
Abstract

Cited by 88 (13 self)
 Add to MetaCart
This paper presents a general paradigm for the design of packet routing algorithms for fixedconnection networks. Its basis is a randomized online algorithm for scheduling any set of N packets whose paths have congestion c on any boundeddegree leveled network with depth L in O(c + L + log N) steps, using constantsize queues. In this paradigm, the design of a routing algorithm is broken into three parts: (1) showing that the underlying network can emulate a leveled network, (2) designing a path selection strategy for the leveled network, and (3) applying the scheduling algorithm. This strategy yields randomized algorithms for routing and sorting in time proportional to the diameter for meshes, butterflies, shuffleexchange graphs, multidimensional arrays, and hypercubes. It also leads to the construction of an areauniversal network: an Nnode network with area Θ(N) that can simulate any other network of area O(N) with slowdown O(log N).
A provable time and space efficient implementation of nesl
 In International Conference on Functional Programming
, 1996
"... In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementa ..."
Abstract

Cited by 71 (7 self)
 Add to MetaCart
In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementation bounds for functional languages by considering space and by including arrays. For modeling the cost of NESL we augment a standard callbyvalue operational semantics to return two cost measures: a DAG representing the sequential dependence in the computation, and a measure of the space taken by a sequential implementation. We show that a NESL program with w work (nodes in the DAG), d depth (levels in the DAG), and s sequential space can be implemented on a p processor butterfly network, hypercube, or CRCW PRAM usin O(w/p + d log p) time and 0(s + dp logp) reachable space. For programs with sufficient parallelism these bounds are optimal in that they give linew speedup and use space within a constant factor of the sequential space. 1
The BirdMeertens Formalism as a Parallel Model
 Software for Parallel Computation, volume 106 of NATO ASI Series F
, 1993
"... The expense of developing and maintaining software is the major obstacle to the routine use of parallel computation. Architecture independent programming offers a way of avoiding the problem, but the requirements for a model of parallel computation that will permit it are demanding. The BirdMeertens ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
The expense of developing and maintaining software is the major obstacle to the routine use of parallel computation. Architecture independent programming offers a way of avoiding the problem, but the requirements for a model of parallel computation that will permit it are demanding. The BirdMeertens formalism is an approach to developing and executing dataparallel programs; it encourages software development by equational transformation; it can be implemented efficiently across a wide range of architecture families; and it can be equipped with a realistic cost calculus, so that tradeoffs in software design can be explored before implementation. It makes an ideal model of parallel computation. Keywords: General purpose parallel computing, models of parallel computation, architecture independent programming, categorical data type, program transformation, code generation. 1 Properties of Models of Parallel Computation Parallel computation is still the domain of researchers and those ...
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Performance Modeling of Distributed Memory Architectures
 Journal of Parallel and Distributed Computing
, 1991
"... We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices, and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory, and a set of registers. The analytic models are verified by measurements on the Connection Machine model CM2. 1 Introduction This paper addresses crucial issues in performance modeling of distributed memory architectures designed for s...
A Provably TimeEfficient Parallel Implementation of Full Speculation
 In Proceedings of the 23rd ACM Symposium on Principles of Programming Languages
, 1996
"... Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (callbyspeculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Scheduling Parallel Communication: The hRelation Problem
 IN PROC. OF THE 20TH INTERNATIONAL SYMP. ON MATHEMATICAL FOUNDATIONS OF COMPUTER SCIENCE, LNCS 969
, 1995
"... This paper is concerned with the efficient scheduling and routing of pointtopoint messages in a distributed computing system with n processors. We examine the hrelation problem, a routing problem where each processor has at most h messages to send and at most h messages to receive. Communica ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
This paper is concerned with the efficient scheduling and routing of pointtopoint messages in a distributed computing system with n processors. We examine the hrelation problem, a routing problem where each processor has at most h messages to send and at most h messages to receive. Communication is carried out in rounds. Direct communication is possible from any processor to any other, and in each round a processor can send one message and receive one message. The offline version of the problem arises when every processor knows the source and destination of every message. In this case the messages can be routed in at most h rounds. More interesting, and more typical, is the online version, in which each processor has knowledge only of h and of the destinations of those messages which it must send. The online version of the problem is the focus of this paper. The difficulty of the hrelation problem stems from message conflicts, in which two or more messages are se...
Asynchronous Shared Memory Search Structures
 In Proc. 8th ACM Symp. on Parallel Algorithms and Architectures
, 1996
"... We study the problem of storing an ordered set on an asynchronous shared memory parallel computer. We examine the case where we want to efficiently perform successor (least upper bound) queries on the set members that are stored. We also examine the case where processors insert and delete members of ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We study the problem of storing an ordered set on an asynchronous shared memory parallel computer. We examine the case where we want to efficiently perform successor (least upper bound) queries on the set members that are stored. We also examine the case where processors insert and delete members of the set. Due to asynchrony, we require processors to perform queries and to maintain the structure independently. Although several such structures have been proposed, the analysis of these structures has been very limited. We here use the recently proposed QRQW PRAM model to provide upper and lower bounds on the performance of such data structures. In the asynchronous QRQW PRAM, the problem of processors concurrently and independently searching a shared data structure is very similar to the problem of routing packets through a network. Using this as a guide, we introduce the SearchButterfly, a search structure that combines the efficient packet routing properties of the butterfly graph wit...