Results 1 - 10
of
25
Contention in Shared Memory Algorithms
, 1993
"... Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at t ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at the same time. Nevertheless, even though contention is one of the principal considerations affecting the performance of real algorithms on real multiprocessors, there are no formal tools for analyzing the contention of asynchronous shared-memory algorithms. This paper introduces the first formal complexity model for contention in multiprocessors. We focus on the standard multiprocessor architecture in which n asynchronous processes communicate by applying read, write, and read-modify-write operations to a shared memory. We use our model to derive two kinds of results: (1) lower bounds on contention for well known basic problems such as agreement and mutual exclusion, and (2) trade-offs betwe...
Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation. In particular, can a shared-memory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing Shared-Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d, x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Accounting for memory bank contention and delay in high-bandwidth multiprocessors
- In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulk-synchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)-BSP. We show experimentally that the (d, x)-BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machine-specific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)-BSP as a bridging model for emulating a very high-level abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)-BSP.
Parallel Sorting With Limited Bandwidth
- in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation an ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusive-read variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- Proc. 5th ACM-SIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to shared-memory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the best-known efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
The Queue-Read Queue-Write Asynchronous PRAM Model
- Euro-Par'96 Parallel Processing, Lecture Notes in Computer Science
, 1998
"... This paper presents results for the queue-read, queue-write asynchronous parallel random access machine (qrqw asynchronous pram) model, which is the asynchronous variant of the qrqw pram model. The qrqw pram family of models, which was introduced earlier by the authors, permit concurrent reading ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
This paper presents results for the queue-read, queue-write asynchronous parallel random access machine (qrqw asynchronous pram) model, which is the asynchronous variant of the qrqw pram model. The qrqw pram family of models, which was introduced earlier by the authors, permit concurrent reading and writing to shared memory locations, but each memory location is viewed as having a queue which can service at most one request at a time. In the basic qrqw pram model each processor executes a series of reads to shared memory locations, a series of local computation steps, and a series of writes to shared memory locations, and then synchronizes with all other processors; thus this can be viewed as a bulk-synchronous model. In contrast, in the qrqw asynchronous pram model discussed in this paper, there is no imposed bulksynchronization between processors, and each processor proceeds at its own pace. Thus, the qrqw asynchronous pram serves as a better model for designing and analyz...
Delayed path coupling and generating random permutations via distributed stochastic processes
, 1999
"... We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that the output configurations in our processes reach almost uniform probability distribution very rapidly, i.e. in a (low) polylogarithmic time. For the analysis of the aforementioned protocols we develop a novel technique, called delayed path coupling, for proving rapid mixing of Markov chains. Our approach is an extension of the path coupling method of Bubley and Dyer. We apply delayed path coupling to three stochastic processes for generating random permutations. For one
Scheduling Parallel Communication: The h-Relation Problem
- IN PROC. OF THE 20TH INTERNATIONAL SYMP. ON MATHEMATICAL FOUNDATIONS OF COMPUTER SCIENCE, LNCS 969
, 1995
"... This paper is concerned with the efficient scheduling and routing of point-to-point messages in a distributed computing system with n processors. We examine the h-relation problem, a routing problem where each processor has at most h messages to send and at most h messages to receive. Communica ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This paper is concerned with the efficient scheduling and routing of point-to-point messages in a distributed computing system with n processors. We examine the h-relation problem, a routing problem where each processor has at most h messages to send and at most h messages to receive. Communication is carried out in rounds. Direct communication is possible from any processor to any other, and in each round a processor can send one message and receive one message. The off-line version of the problem arises when every processor knows the source and destination of every message. In this case the messages can be routed in at most h rounds. More interesting, and more typical, is the on-line version, in which each processor has knowledge only of h and of the destinations of those messages which it must send. The on-line version of the problem is the focus of this paper. The difficulty of the h-relation problem stems from message conflicts, in which two or more messages are se...
A randomized time-work optimal parallel algorithm for finding a minimum spanning forest
- SIAM J. COMPUT
, 1999
"... We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the fi ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for tree-like computations.

