Results 1  10
of
69
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract

Cited by 82 (6 self)
 Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.
Questions And Answers About BSP
, 1996
"... Bulk Synchronous Parallelism (BSP) is a parallel programming model that abstracts from lowlevel program structures in favour of supersteps. A superstep consists of a set of independent local computations, followed by a global communication phase and a barrier synchronisation. Structuring programs i ..."
Abstract

Cited by 52 (4 self)
 Add to MetaCart
Bulk Synchronous Parallelism (BSP) is a parallel programming model that abstracts from lowlevel program structures in favour of supersteps. A superstep consists of a set of independent local computations, followed by a global communication phase and a barrier synchronisation. Structuring programs in this way enables their costs to be accurately determined from a few simple architectural parameters, namely the permeability of the communication network to uniformlyrandom traffic and the time to synchronise. Although permutation routing and barrier synchronisations are widely regarded as inherently expensive, this is not the case. As a result, the structure imposed by BSP comes for free in performance terms, while bringing considerable benefits from an applicationbuilding perspective. This paper answers the most common questions we are asked about BSP and justifies its claim to be a major step forward in parallel programming. 1 Why is another model needed? In the 1980s a large number ...
The Theory, Practice, And A Tool For BSP Performance Prediction Applied To A CFD Application
 In Europar'96, volume 1124 of LNCS
, 1996
"... The Bulk Synchronous Parallel (BSP) model provides a theoretical framework to accurately predict the execution time of parallel programs. In this paper we describe a BSP programming library that has been developed, and contrast two approaches to analysing performance: (1) a pencil and paper method w ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
The Bulk Synchronous Parallel (BSP) model provides a theoretical framework to accurately predict the execution time of parallel programs. In this paper we describe a BSP programming library that has been developed, and contrast two approaches to analysing performance: (1) a pencil and paper method with a theoretical cost model; (2) a profiling tool that analyses trace information generated during program execution. These approaches are evaluated on an industrial application code that solves fluid dynamics equations around a complex aircraft geometry on an IBM SP2 and SGI PowerChallenge. We show how the tool can be used to explore the communication patterns of the CFD code and accurately predict the performance of the application on any parallel machine. 1 Introduction The efficient implementation of complex algorithms onto parallel machines is an arduous task. Furthermore, the resulting performance is often only known once this task has been completed. This is unsatisfactory consideri...
BSP vs LogP
, 1996
"... A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two model ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two models are derived, showing their substantial equivalence for algorithmic design guided by asymptotic analysis. It is also shown that the two models can be implemented with similar performance on most pointtopoint networks. In conclusion, within the limits of our analysis that is mainly of an asymptotic nature, BSP and (stallfree) LogP can be viewed as closely related variants within the bandwidthlatency framework for modeling parallel computation. BSP seems somewhat preferable due to its greater simplicity and portability, and slightly greater power. LogP lends itself more naturally to multiuser mode.
A Proposal for the BSP Worldwide Standard Library (preliminary version)
"... This memory area is reguarded as unregistered. 6. The explicit registration mechanism removes possible implicit assumptions about the compilation of static data as used by the Cray SHMEM and Oxford BSP libraries. It should be noted that static data structures will always have the simple registration ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
This memory area is reguarded as unregistered. 6. The explicit registration mechanism removes possible implicit assumptions about the compilation of static data as used by the Cray SHMEM and Oxford BSP libraries. It should be noted that static data structures will always have the simple registration information:
The Paderborn University BSP (PUB) Library  Design, Implementation and Performance
 In Proc. of 13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing (IPPS/SPDP
, 1999
"... The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The basic library supports buffered and unbuffered asynchronous communication between any pair of processors, and a mechanism for synchronizing the processors in a barrier style. In addition, it provides routi ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The basic library supports buffered and unbuffered asynchronous communication between any pair of processors, and a mechanism for synchronizing the processors in a barrier style. In addition, it provides routines for collective communication on arbitrary subsets of processors, partition operations, and a zerocost synchronization mechanism. Furthermore, some techniques used in its implementation deviate significantly from the techniques used in other BSP libraries. 1. Introduction Most messagepassing libraries, like PVM[6] and MPI[12], are based on pairwise sends and receives. This means that for each send operation, a matching receive has to be issued on the destination processor. This approach, however, is very error prone because deadlocks can be easily introduced if a message is never accepted. Furthermore, it is difficult to determine the correctness and complexity of programs implemented ...
BSPLike ExternalMemory Computation
 IN PROC. 3RD ITALIAN CONFERENCE ON ALGORITHMS AND COMPLEXITY
"... In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the noti ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the notion of xoptimality carries over to algorithms designed according to our paradigm. The advantages of the approach are similar to the advantages of BSP algorithms for parallel computing: scalability, portability, predictability. The performance measure here is the total work, not only the number of I/O operations as in previous approaches. The predicted performances are therefore more useful for practical applications.
Optimization Rules for Programming with Collective Operations
 IPPS/SPDP'99. 13th Int. Parallel Processing Symp. & 10th Symp. on Parallel and Distributed Processing
, 1999
"... We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fu ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fusing several collective operations into one; (3) performance estimates, which guide the application of optimization rules depending on the machine characteristics; (4) a simple case study with the first results of machine experiments.
Towards Parallel Programming by Transformation: The FAN Skeleton Framework
, 2001
"... A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means of skeletons  highlevel patterns with parallel semantics. The main weakness of the current programming systems based on skeletons is that the user is still responsible for finding the ..."
Abstract

Cited by 20 (10 self)
 Add to MetaCart
A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means of skeletons  highlevel patterns with parallel semantics. The main weakness of the current programming systems based on skeletons is that the user is still responsible for finding the most appropriate skeleton composition for a given application and a given parallel architecture. We describe a transformational framework for the development of skeletal programs which is aimed at filling this gap. The framework makes use of transformation rules which are semantic equivalences among skeleton compositions. For a given problem, an initial, possibly inefficient skeleton specification is refined by applying a sequence of transformations. Transformations are guided by a set of performance prediction models which forecast the behavior of each skeleton and the performance benefits of different rules. The design process is supported by a graphical tool which locates applicable transformations and provides performance estimates, thereby helping the programmer in navigating through the program refinement space. We give an overview of the FAN framework and exemplify its use with performancedirected program derivations for simple case studies. Our experience can be viewed as a first feasibility study of methods and tools for transformational, performancedirected parallel programming using skeletons.
Ultimate Parallel List Ranking?
 Journal of Parallel and Distributed Computing
, 2000
"... Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms wher ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms where required for the best performance, now this single algorithm suffices. If the problem size is much larger than the number of available processors, then the "sparserulingsets" algorithm is even better. In previous versions this algorithm had very restricted practical application because of the large number of communication rounds it was performing. This main weakness of this algorithm is overcome by adding two new ideas, each of which reduces the number of communication rounds by a factor of two. 1 Introduction A list is a basic data structure: it consists of nodes which are linked together, so that every node has precisely one predecessor and one successor, except for the initial n...