Results 1 - 10
of
30
On the Physical Design of PRAMs
, 1993
"... The Saarbrucken Parallel Random Access Machine (SB-PRAM) is a scalable shared memory machine. At the gate level it is a re-engineered version of the Fluent machine [A. G. Ranade, S. N. Bhatt and S. L. Johnson. The Fluent Abstract Machine. In Proc. 5th MIT Conference on Advanced Research in VLSI, pp. ..."
Abstract
-
Cited by 46 (13 self)
- Add to MetaCart
The Saarbrucken Parallel Random Access Machine (SB-PRAM) is a scalable shared memory machine. At the gate level it is a re-engineered version of the Fluent machine [A. G. Ranade, S. N. Bhatt and S. L. Johnson. The Fluent Abstract Machine. In Proc. 5th MIT Conference on Advanced Research in VLSI, pp. 71--93 (1988)]. It uses hashing of adresses, combining and latency hiding. A prototype with 128 processors is presently being designed. In this paper we deal with several problems related to the physical design of this machine such as the total number of network chips, the geometrical arrangement of boards in the network and the VLSI realization of certain sorting arrays. We also present an extremely fast method to rehash addresses without use of external memory. Research was partially supported by DFG (SFB 124) and SIEMENS AG. A preliminary version of this paper appeared in [1]. 1 Introduction Parallel machines are nowadays classified as multi-computers and multi-processors. In multi-...
The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- Proc. 5th ACM-SIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to shared-memory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the best-known efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Simulation-based Comparison of Hash Functions for Emulated Shared Memory
- In Proc. Parallel Architectures and Languages Europe, LNCS 694
, 1993
"... . The influence of several hash functions on the distribution of a shared address space onto p distributed memory modules is compared by simulations. Both synthetic workloads and address traces of applications are investigated. It turns out that on all workloads linear hash functions, although prove ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
. The influence of several hash functions on the distribution of a shared address space onto p distributed memory modules is compared by simulations. Both synthetic workloads and address traces of applications are investigated. It turns out that on all workloads linear hash functions, although proven to be asymptotically worse, perform better than theoretically optimal polynomials of degree O(log p). The latter are also worse than hash functions that use boolean matrices. The performance measurements are done by an expected worst case analysis. Thus linear hash functions provide an efficient and easy to implement way to emulate shared memory. 1 Introduction Users of parallel machines more and more tend to program with the view of a global shared memory. Commercial machines (with more than 16 processors) however usually have distributed memory modules. Therefore the address space has to be mapped onto memory modules, memory access is simulated by packet routing on a network connecting ...
Realization of PRAMs: Processor Design
- Proc. WDAG'94, 8th Int. Workshop on Distributed Algorithms, Springer LNCS
, 1994
"... . We present a processor architecture for SB-PRAM, a parallel machine with shared address space and uniform memory access time. The processor uses a reduced instruction set and provides in hardware mechanisms for the emulation of shared memory: random hashing to avoid hot spots, multiple contexts wi ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
. We present a processor architecture for SB-PRAM, a parallel machine with shared address space and uniform memory access time. The processor uses a reduced instruction set and provides in hardware mechanisms for the emulation of shared memory: random hashing to avoid hot spots, multiple contexts with regular scheduling to hide network latency and fast context switch to minimize overhead. Furthermore it provides hardware support for parallel operating systems and for the efficient compilation of parallel high level languages. We give technical data for a prototype VLSI implementation with a floating point unit. 1 Introduction Parallel programming imposes more burdens on the programmer than programming in a sequential setting, but is necessary as long as automatic parallelization does not show satisfying results for all problem fields. Programming with the view of a shared memory has become popular, because it frees the programmer at least from the mapping of data and from programming ...
Simulation of PRAM Models on Meshes
- Nordic Journal on Computing, 2(1):51
, 1994
"... We analyze the complexity of simulating a PRAM (parallel random access machine) on a mesh structured distributed memory machine. By utilizing suitable algorithms for randomized hashing, routing in a mesh, and sorting in a mesh, we prove that simulation of a PRAM on p N \Theta p N (or 3 p N \The ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
We analyze the complexity of simulating a PRAM (parallel random access machine) on a mesh structured distributed memory machine. By utilizing suitable algorithms for randomized hashing, routing in a mesh, and sorting in a mesh, we prove that simulation of a PRAM on p N \Theta p N (or 3 p N \Theta 3 p N \Theta 3 p N ) mesh is possible with O( p N ) (respectively O( 3 p N )) delay with high probability and a relatively small constant. Furthermore, with more sophisticated simulations further speed-ups are achieved; experiments show delays as low as p N + o( p N ) (respectively 3 p N + o( 3 p N )) per N PRAM processors. These simulations compare quite favorably with PRAM simulations on butterfly and hypercube. 1 Introduction PRAM 1 (Parallel Random Access Machine) is an abstract model of computation. It consists of N processors, each of which may have some local memory and registers, and a global shared memory of size m. A step of a PRAM is often seen to consist of...
Can Parallel Algorithms Enhance Serial Implementation? (Extended Abstract)
, 1996
"... The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of th ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curriculum irrespective of whether (or when) parallel processing will become ubiquitous in the generalpurpose computing world. (3) A strategic agenda for high-performance parallel computing: A multi-stage agenda, which in no stage compromises user-friendliness of the programmer 's...
Logic of Global Synchrony
, 2001
"... An intermediate-level specification notation is presented for use with BSP-style programming. It is achieved by extending pre-post semantics to reveal state at points of global synchronisation. That enables us to integrate the pre-post, finite and reactive-process styles of specification in BSP, as ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
An intermediate-level specification notation is presented for use with BSP-style programming. It is achieved by extending pre-post semantics to reveal state at points of global synchronisation. That enables us to integrate the pre-post, finite and reactive-process styles of specification in BSP, as shown by our treatment of the dining philosophers. The language is provided with a complete set of laws and has been formulated to benefit from a simple predicative semantics.
Program Development and Performance Prediction on BSP Machines Using Opal
, 1994
"... Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. [ Abolhassan et al., 1991 ] analyses Ranade's approach in a quantitative way by giving cost models for implementing various parts of the PRAM machine. This is then used to d ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. [ Abolhassan et al., 1991 ] analyses Ranade's approach in a quantitative way by giving cost models for implementing various parts of the PRAM machine. This is then used to demonstrate an improvement on Ranade's Fluent machine using multiple butterflies and parallel slackness. It is then shown that the proposed improved Fluent machine would have a similar price / performance ratio of conventional distributed memory architectures. Other attempts at realising the PRAM model involves it's simulation on conventional distributed memory architectures. This method usually involves hashing the address space of the PRAM across the distributed memory of the machine and replication of variables [ Mehlhorn and Vishkin, 1984 ] , or using multiple hash functions [ Abolhassan et al., 1991 ] . 2.2 BSP A Bulk Synchronous Parallel machine consists of a number of processor memo...
The Programming Environment of the SB-PRAM
- In Proc. 7th IASTED/ISMM Int.l Conf. on Parallel and Distributed Computing and Systems, Washington DC
, 1995
"... The SB-PRAM is a shared-memory parallel computer that realizes the CRCW-PRAM model from theoretical computer science. In this paper, the SB-PRAM system is described from a programmers point of view. Special emphasis is put on the process creation scheme and on the efficient implementation of synchro ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The SB-PRAM is a shared-memory parallel computer that realizes the CRCW-PRAM model from theoretical computer science. In this paper, the SB-PRAM system is described from a programmers point of view. Special emphasis is put on the process creation scheme and on the efficient implementation of synchronization constructs of the P4 library. Key Words architecture, massively parallel systems, software, PRAM, shared memory 1 Introduction The theoretical PRAM Model [6] is widely used in the theory community for specifying parallel algorithms in an elegant way [6]. A PRAM consists of an unbounded set of processors which compute synchronously in parallel. There is a single unbounded shared memory in which each processor can access any cell in unit time. This allows a synchronous execution of parallel programs on the instruction level leading to a fine grain parallelism without time consuming synchronization. There are different possibilities for dealing with concurrent accesses to a single me...
Reduction of network cost and wiring in Ranade's butterfly routing
, 1993
"... The class of n--input butterfly networks is very important for the design of scalable parallel machines because of constant node degree, depth log n and the existence of routing algorithms capable of delivering n log n packets, forming a log n--relation (log n packets destined to each output), in ti ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The class of n--input butterfly networks is very important for the design of scalable parallel machines because of constant node degree, depth log n and the existence of routing algorithms capable of delivering n log n packets, forming a log n--relation (log n packets destined to each output), in time O(log n) with constant length buffers. Implementations however show ugly wiring. We present a method that, in comparison to the obvious implementation, reduces the number of chips and the number of links between chips in Ranade's butterfly routing by a factor of more than two. The chips remain connected as a butterfly network. This reduction in space simplifies cooling, power supply and allows for shorter links thus reducing wire delay. 1991 Mathematics Subject Classification: 68M07, 68M10, 94C15 1991 CR Categories: B.4.3, B.4.4, B.7.1 Keywords and Phrases: computer architecture, packet routing algorithms, butterfly networks Note: Part of this work was done while the third author was working at the Univ. des Saarlandes, Saarbrucken, Germany. This work is partially supported by the German Science Foundation (DFG) under SFB 124 -- D4 and by the Dutch Science Foundation (NWO) through NFI Project ALADDIN under Contract number NF 62--376. D. Cross is currently with Mentor Graphics Corporation, 1001 Ritter Park Drive, San Jose 95131. This article has been accepted for publication in Information Processing Letters. 1

