Results 1  10
of
16
Tradeoffs Between Communication Throughput and Parallel Time
, 1994
"... We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by a ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by all the processors. We will be primarily interested in studying cases where n AE p AE m. As a test case we study the list reversal problem. For this problem we prove a time lower bound of \Omega\Gamma n p mp ). (A similar lower bound holds also for the problems of sorting, finding all unique elements, convolution, and universal hashing.) This result shows that limiting the communication (i.e., small m) has significant effect on parallel computation. We show an almost matching upper bound of O( n p mp log O(1) n). The upper bound requires the development of a few interesting techniques which can alleviate the limited communication in some
Efficient Deterministic and Probabilistic Simulations of PRAMs on Linear Arrays with Reconfigurable Pipelined Bus Systems
 Journal of Supercomputing
, 2000
"... . In this paper, we present deterministic and probabilistic methods for simulating PRAM computations on linear arrays with reconfigurable pipelined bus systems (LARPBS). The following results are established in this paper. (1) Each step of a pprocessor PRAM with m = O#p# shared memory cells can b ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
. In this paper, we present deterministic and probabilistic methods for simulating PRAM computations on linear arrays with reconfigurable pipelined bus systems (LARPBS). The following results are established in this paper. (1) Each step of a pprocessor PRAM with m = O#p# shared memory cells can be simulated by a pprocessors LARPBS in O#log p# time, where the constant in the bigO notation is small. (2) Each step of a pprocessor PRAM with m = ##p# shared memory cells can be simulated by a pprocessors LARPBS in O#log m# time. (3) Each step of a pprocessor PRAM can be simulated by a pprocessor LARPBS in O#log p# time with probability larger than 1  1/p c for all c>0. (4) As an interesting byproduct, we show that a pprocessor LARPBS can sort p items in O#log p# time, with a small constant hidden in the bigO notation. Our results indicate that an LARPBS can simulate a PRAM very efficiently. Keywords: Concurrent read, concurrent write, deterministic simulation, linear array...
Simulation of PRAM Models on Meshes
 Nordic Journal on Computing, 2(1):51
, 1994
"... We analyze the complexity of simulating a PRAM (parallel random access machine) on a mesh structured distributed memory machine. By utilizing suitable algorithms for randomized hashing, routing in a mesh, and sorting in a mesh, we prove that simulation of a PRAM on p N \Theta p N (or 3 p N \The ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
We analyze the complexity of simulating a PRAM (parallel random access machine) on a mesh structured distributed memory machine. By utilizing suitable algorithms for randomized hashing, routing in a mesh, and sorting in a mesh, we prove that simulation of a PRAM on p N \Theta p N (or 3 p N \Theta 3 p N \Theta 3 p N ) mesh is possible with O( p N ) (respectively O( 3 p N )) delay with high probability and a relatively small constant. Furthermore, with more sophisticated simulations further speedups are achieved; experiments show delays as low as p N + o( p N ) (respectively 3 p N + o( 3 p N )) per N PRAM processors. These simulations compare quite favorably with PRAM simulations on butterfly and hypercube. 1 Introduction PRAM 1 (Parallel Random Access Machine) is an abstract model of computation. It consists of N processors, each of which may have some local memory and registers, and a global shared memory of size m. A step of a PRAM is often seen to consist of...
Special Issue on Group Communication Systems
 In Communications of the ACM
, 1996
"... Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is w ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is waiting for the completion of a memory operation. In this paper we investigate the implementation of multithreading in the processorlevel. As a result we outline and evaluate a MultiThreaded VLIW processor Architecture with functional unit Chaining (MTAC), which is specially designed for PRAMstyle parallelism. According to our experiments MTAC offers remarkably better performance than a basic pipelined RISC architecture and chaining improves the exploitation of instruction level parallelism to a level where the achieved speedup corresponds to the number of functional units in a processor.
Memory Module Structures for Shared Memory Simulation
, 2000
"... The shared memory programming model on top of a physically distributed memory machine (SDMM) is a promising candidate for easytoprogram general purpose parallel computation. There are, however, certain open technical problems which should be sufficiently solved before SDMM can meet the expectation ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
The shared memory programming model on top of a physically distributed memory machine (SDMM) is a promising candidate for easytoprogram general purpose parallel computation. There are, however, certain open technical problems which should be sufficiently solved before SDMM can meet the expectations. Among them is lowlevel structure of memory system, because most academic studies of the subject assume unrealisticly ideal memory properties, ignoring completely, e.g., the speed difference between processors and memories. In this paper we propose three memory module structures based on lowlevel interleaving and caching for solving this speed difference problem. We evaluate these structures along with three reference solutions by determining the overall cost factor of memory references in respect to ideal SDMM using real parallel programs. According to the evaluation the cost less than two is achieved with an interleaved solution, if proper amount of parallelism is available. Moreover,...
WorkOptimal Simulation of PRAM Models on Meshes
 Nordic Journal on Computing, 2(1):51
, 1994
"... In this paper we consider workoptimal simulations of PRAM models on coated meshes. Coated meshes consist of a mesh connected routing machinery with processors on the surface of the mesh. We prove that coated meshes with 2dimensional or 3dimensional routing machinery can workoptimally simulate ER ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
In this paper we consider workoptimal simulations of PRAM models on coated meshes. Coated meshes consist of a mesh connected routing machinery with processors on the surface of the mesh. We prove that coated meshes with 2dimensional or 3dimensional routing machinery can workoptimally simulate EREW, CREW, and CRCW PRAM models. The general idea behind this simulation is to use Valiant's XPRAM approach, and ignore the workcomplexity of simple nodes of the routing machinery. 1 Introduction There are a wide variety of approaches to parallelism in general [40], and even to general purpose parallelism [39]  reflecting the prevailing uncertainty of the correct approach. One model aiming at general purpose parallelism is the PRAM (Parallel Random Access Machine) model, which is a natural generalization of the classical RAM model. It consists of N processors, each of which may have some local memory and registers, and a global shared memory of size m. A step of PRAM is often seen to con...
Systolic Combining Switch Designs
, 1994
"... Ihavebeen fortunate to carry out the research for this dissertation in the stimulating and supportive environment of the Ultracomputer Research Laboratory. Iwould like to thank my advisor, Allan Gottlieb, Ultra's director, for his role in creating that environment aswell as for his careful read ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Ihavebeen fortunate to carry out the research for this dissertation in the stimulating and supportive environment of the Ultracomputer Research Laboratory. Iwould like to thank my advisor, Allan Gottlieb, Ultra's director, for his role in creating that environment aswell as for his careful reading of several drafts of my dissertation. As part of my work for Ultra, I have enjoyed fruitful research collaborations with Richard Kenner, my longterm partner in VLSI design, Ora Percus, my mentor in stochastic analysis and queueing theory, Yuesheng Liu, who helped me formalize my thinking about switch types as well as simulating network behavior, Jan Edler, who was always willing to add another feature to his network simulator, and Ron Bianchini, who makes hardware work. In the computer science department as a whole, I would like to thank Alan Siegel, for reading my dissertation and making helpful comments, Ernie Davis and Richard Wallace, for serving on my committee, Elaine Weyuker, whose literature review seminar was very helpful in building both my con dence and competence to do computer science research, and Anina KarmenMeade, who often helped me navigate NYU's bureaucratic labyrinths. Most of all, my gratitude goes to my husband, Tom Du, for his faith and support at all times, and to my children, Rachael Evans, Keelan Evans and Timothy Du. No matter how the research is going, my wonderful family always lls my life with joy. ii Contents
Experimental Results for Four WorkOptimal PRAM Simulation Algorithms on Coated Meshes
, 1994
"... In this paper we consider the effect of overloading in four workoptimal PRAM simulation algorithms on coated meshes with P real processors. A coated mesh consists of a mesh connected routing machinery, and processor&memory pairs, which form a coat on the routing machinery. Previously workoptim ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
In this paper we consider the effect of overloading in four workoptimal PRAM simulation algorithms on coated meshes with P real processors. A coated mesh consists of a mesh connected routing machinery, and processor&memory pairs, which form a coat on the routing machinery. Previously workoptimal PRAM simulations, which ignore the effect of overloading, has been presented for coated meshes, but their cost is relatively high (around 100). The algorithms we study here are based on greedy routing, sorting, improved virtual levelled network technique, and combining queues method. Our results show that overloading alone can be used to improve the simulation cost of all PRAM models on coated meshes to circa 10 (and even less) routing steps per P simulated PRAM processors. 1 Introduction In [13] three algorithms for simulating PRAM models on coated meshes were presented (see also [15, 16]). The EREW PRAM simulation algorithm was based on a modification of the basic greedy routing algorithm ...
Integrating Synchronous and Asynchronous Paradigms: The Fork95 Parallel Programming Language
 Proc. MPPM95 Int. Conf. on Massively Parallel Programming Models
, 1995
"... ..."
The Fork95 Parallel Programming Language: Design, Implementation, Application
, 1997
"... Fork95 is an imperative parallel programming language intended to express algorithms for synchronous shared memory machines (PRAMs). It is based on ANSI C and offers additional constructs to hierarchically divide processor groups into subgroups and manage shared and private address subspaces. Fork95 ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Fork95 is an imperative parallel programming language intended to express algorithms for synchronous shared memory machines (PRAMs). It is based on ANSI C and offers additional constructs to hierarchically divide processor groups into subgroups and manage shared and private address subspaces. Fork95 makes the assemblylevel synchronicity of the underlying hardware available to the programmer at the language level. Nevertheless, it supports locally asynchronous computation where desired by the programmer. We present a one{pass compiler, fcc, which compiles Fork95 and C programs to the SBPRAM machine. The SBPRAM is a lock{step synchronous, massively parallel multiprocessor currently being built at Saarbrücken University, with a physically shared memory and uniform memory access time. We examine three important types of parallel computation frequently used for the parallel solution of real{world problems. While farming and parallel divideandconquer are directly supported by Fork95 language constructs, pipelining can be easily expressed using existing language features � an additional language construct for pipelining is not required.