Results 1  10
of
36
LogP: Towards a Realistic Model of Parallel Computation
, 1993
"... A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding developme ..."
Abstract

Cited by 497 (14 self)
 Add to MetaCart
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM5.
LogGP: Incorporating Long Messages into the LogP Model  One step closer towards a realistic model for parallel computation
, 1995
"... We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the comm ..."
Abstract

Cited by 236 (1 self)
 Add to MetaCart
We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the communication of fixedsized short messages through the use of four parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). As evidenced by experimental data, the LogP model can accurately predict communication performance when only short messages are sent (as on the CM5) [CKP + 93, CDMS94]. However, many existing parallel machines have special support for long messages and achieve a much higher bandwidth for long messages compared to short messages (e.g., IBM SP2, Paragon, Meiko CS2, Ncube/2). We extend the basic LogP model with a linear model for long messages. This combination, which we call the LogGP model of parallel computation, has o...
Optimal Broadcast and Summation in the LogP Model
 In Proc. 5th ACM Symp. on Parallel Algorithms and Architectures
, 1993
"... In many distributedmemory parallel computers the only builtin communication primitive is pointtopoint message transmission, and more powerful operations suchas broadcast and synchronization must be realized using this primitive. Within the LogP model of parallel computation we present algorithms ..."
Abstract

Cited by 54 (2 self)
 Add to MetaCart
In many distributedmemory parallel computers the only builtin communication primitive is pointtopoint message transmission, and more powerful operations suchas broadcast and synchronization must be realized using this primitive. Within the LogP model of parallel computation we present algorithms that yield optimal communication schedulesfor several broadcastand synchronization operations. Most of our algorithms are the absolutely best possible in that not even the constant factors can be improved upon. For one particular broadcast problem, called continuous broadcast, the optimality of our algorithm is not yet completely proven, although proofs have been achieved for a certain range of parameters. We also devise an optimal algorithm for summing or, more generally, applying a noncommutative associative binary operator to a set of operands. 1 Introduction Most models of parallel computation reflect the communication bottlenecks of real parallel machines inadequately. The PRAM [11],...
Can a SharedMemory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract

Cited by 42 (11 self)
 Add to MetaCart
There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the sharedmemory abstraction as an easytouse platform, the bandwidth limitations of current machines have diverted much attention to messagepassing and distributedmemory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a sharedmemory model can serve as an effective bridging model for parallel computation. In particular, can a sharedmemory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing SharedMemory (QSM) model, which accounts for limited communication bandwidth while still providing a simple sharedmemory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple workpreserving emulation of the QSM on both the BSP, and on a related model, the (d, x)BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
A Quantitative Comparison of Parallel Computation Models
, 1996
"... This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP1 with 1024 processors, a 64node GCel and a CM5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP1 with 1024 processors, a 64node GCel and a CM5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate whether the models correctly predict the execution time of an algorithm implementation. Unlike previous work, which mostly demonstrated a close match between the measured and predicted running times, this paper shows that there are situations in which the models do not precisely predict the actual execution time of an algorithm implementation. Second, there is a comparison part in which the models are contrasted with each other in order to determine which model induces the fastest algorithms. Finally, there is an efficiency validation part in which the performance of the model derived algorithms are compared with the performance of highly optimized library routines to show the effectiveness ...
Parallel Sorting With Limited Bandwidth
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation an ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusiveread variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
Tradeoffs Between Communication Throughput and Parallel Time
, 1994
"... We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by a ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by all the processors. We will be primarily interested in studying cases where n AE p AE m. As a test case we study the list reversal problem. For this problem we prove a time lower bound of \Omega\Gamma n p mp ). (A similar lower bound holds also for the problems of sorting, finding all unique elements, convolution, and universal hashing.) This result shows that limiting the communication (i.e., small m) has significant effect on parallel computation. We show an almost matching upper bound of O( n p mp log O(1) n). The upper bound requires the development of a few interesting techniques which can alleviate the limited communication in some
CICO: A Practical SharedMemory Programming Performance Model
 Workshop on Portability and Performance for Parallel Processing
, 1993
"... A programming performance model provides a programmer with feedback on the cost of program operations and is a necessary basis to write efficient programs. Many sharedmemory performance models do not accurately capture the cost of interprocessor communication caused by nonlocal memory references ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
A programming performance model provides a programmer with feedback on the cost of program operations and is a necessary basis to write efficient programs. Many sharedmemory performance models do not accurately capture the cost of interprocessor communication caused by nonlocal memory references, particularly in computers with caches. This paper describes a simple and practical programming performance modelcalled checkin, checkout (CICO)for cachecoherent, sharedmemory parallel computers. cica consists of two components. The first is a collection of annotations that a programmer adds to a program to elucidate the communication arising from sharedmemory references. The second is a model that calculates the communication cost of these annotations. An annotation's cost models the cost of the memory references that it summarizes and serves as a metric to compare alternative implementations. Several examples demonstrate that cica accurately predicts cache misses and identifies changes that improve program performance.
Clumps: A Candidate Model Of Efficient, General Purpose Parallel Computation
, 1994
"... A new model of parallel computation is proposed, CLUMPS (Campbell's Lenient, Unified Model of Parallel Systems). This is composed of an abstract machine with an associated cost model, and aims to be more portable, reflective of costs, expressible and encouraging of more efficient implementations of ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
A new model of parallel computation is proposed, CLUMPS (Campbell's Lenient, Unified Model of Parallel Systems). This is composed of an abstract machine with an associated cost model, and aims to be more portable, reflective of costs, expressible and encouraging of more efficient implementations of algorithms than other existing models. It is shown that each basic parallel architecture class can congruently perform each other's computations, but the congruent simulation of each other's communication is not generally possible (where for a simulation to be congruent the simulation costs on the target architecture are asymptotically equivalent to the implementation costs on the native architectures). This is reflected in the CLUMPS abstract machine through its flexibility in terms of program control and memory access. The congruence requirement is relaxed so that though strict congruence may not be achieved according to the above definition, communication costs are reflectively accounted ...
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearestneighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key