Results 1  10
of
24
Horizons of Parallel Computation
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1993
"... This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the consequences of the accomplished attainment of the physical bounds. The main result is that scalability holds only for neighborly interconnections, such as the square mesh, of boundedsize synchronous modules, presumably of the areauniversal type. We also discuss the ultimate infeasibility of latencyhiding, the violation of intuitive maximal speedups, and the emerging novel processortime tradeoffs.
Coscheduling Based on RunTime Identification of Activity Working Sets
 International Journal of Parallel Programming
"... This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monito ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monitoring access rates to shared communication objects: activities that access the same objects at a high rate thereby interact frequently, and therefore would benefit from coscheduling. Simulation results show that coscheduling with our runtime identification scheme can give better performance than uncoordinated scheduling based on a single global activity queue. The finergrained the interactions among the activities in a working set, the better the performance differential. Moreover, coscheduling based on automatic runtime identification achieves about the same performance as coscheduling based on manual identification of working sets by the programmer. Keywords: coscheduling, gang scheduling, online ...
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
(Show Context)
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Supporting Sets of Arbitrary Connections on iWarp Through Communication Context Switches
 In Proc. SPAA
, 1993
"... In this paper we introduce the ConSet communication model for distributed memory parallel computers. The communication needs of an application program can be satisfied by some arbitrary set of connections which are partitioned into discrete phases. A communication context switch is used to select th ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
In this paper we introduce the ConSet communication model for distributed memory parallel computers. The communication needs of an application program can be satisfied by some arbitrary set of connections which are partitioned into discrete phases. A communication context switch is used to select the active phase. We present an implementation of the ConSet model on the iWarp and describe its performance characteristics, contrasting it to a message passing implementation on the same machine. Our implementation demonstrates how one existing parallel computer can function as a “reconfigurable network ” without needing a new processor interconnect technology. The ConSet model works best when communication patterns can be optimized at compile time. We examine the interactions of the target architecture with the algorithmic problems encountered designing a communication compiler to effectively partition, route, and schedule connections. We built a prototype communication compiler for our iWarp implementation, and are using it to generate iWarp code. Looking at basic communication patterns as well as patterns generated by an iterative finite element PDE solver, we compare ConSet’s performance (using the compiler’s schedules) to that of message passing. Our experiments suggestthat ConSet communication offers a performance advantage over messagepassing in applications where the communication pattern is known at compile time. 1
Computing Global Combine Operations in the MultiPort Postal Model
, 1996
"... Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequent ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequently used in many messagepassing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multiport postal model for messagepassing systems. This model is characterized by three parameters: n  the number of processors, k  the number of ports per processor, and  the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent out from k other processors \Gamma 1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of comm...
Implementation of Parallel Graph Algorithms on a Massively Parallel SIMD Computer with Virtual Processing
, 1995
"... We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of paralle ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of parallel graph algorithms with the constraint that the maximum input size was restricted to be no more than the physical number of processors on the MasPar. The MasPar language MPL that we used for our code does not support virtual processing. In this paper, we describe a method of simulating virtual processors on the MasPar. We recoded and finetuned our earlier parallel graph algorithms to incorporate the usage of virtual processors. Under the current implementation scheme, there is no limit on the number of virtual processors that one can use in the program as long as there is enough main memory to store all the data required during the computation. We also give two general optimization techniq...
Parallel I/O Systems and Interfaces for Parallel Computers
, 1995
"... Introduction Continued improvements in processor performance have exposed I/O subsystems as a significant bottleneck, which prevents applications from achieving full system utilization [33, 54]. This problem is exacerbated in massively parallel processors (MPPs), where multiple processors are used ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Introduction Continued improvements in processor performance have exposed I/O subsystems as a significant bottleneck, which prevents applications from achieving full system utilization [33, 54]. This problem is exacerbated in massively parallel processors (MPPs), where multiple processors are used together. As a result, I/O subsystems have become the focus of much research, leading to the design of parallel I/O hardware and matching system software. The requirement driving the work on I/O subsystems is the desire to achieve a balanced system [8]. The degree to which a system is balanced is typically expressed by the F=b ratio, which is defined as the ratio of the rate of executing floating point operations (F ) to the rate of performing I/O, in bits per second (b). A widely accepted rule of thumb, attributed to Amdahl, calls for F=b 1. While this was originally expressed in instructions rather than floating po
The Race Network Architecture
 Proceedings of the 9 th International Parallel Processing Symposium (IPPS ’95), sponsor: IEEE Computer Society Technical Committee on Parallel Processing
, 1995
"... ..."
(Show Context)
Gossiping in VertexDisjoint Paths Mode in dDimensional Grids and Planar Graphs (Extended Abstract)
 Information and Computation
, 1993
"... Juraj Hromkovic y , Ralf Klasing, Elena A. Stohr, Hubert Wagener z Department of Mathematics and Computer Science University of Paderborn, 33095 Paderborn, Germany Abstract The communication modes (oneway and twoway mode) used for sending messages to processors of interconnection networks via ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Juraj Hromkovic y , Ralf Klasing, Elena A. Stohr, Hubert Wagener z Department of Mathematics and Computer Science University of Paderborn, 33095 Paderborn, Germany Abstract The communication modes (oneway and twoway mode) used for sending messages to processors of interconnection networks via vertexdisjoint paths in one communication step are investigated. The complexity of communication algorithms is measured by the number of communication steps (rounds). Here, the complexity of gossiping in grids and in planar graphs is investigated. The main results are the following: 1. Effective oneway and twoway gossip algorithms for ddimensional grids, d 2, are designed. 2. The lower bound 2 log 2 n \Gamma log 2 k \Gamma log 2 log 2 n \Gamma 2 is established on the number of rounds of every twoway gossip algorithm working on any graph of n nodes and vertex bisection k. This proves that the designed twoway gossip algorithms on ddimensional grids, d 3, are almost optimal, and it al...