Results 1  10
of
11
Optimization Rules for Programming with Collective Operations
 IPPS/SPDP'99. 13th Int. Parallel Processing Symp. & 10th Symp. on Parallel and Distributed Processing
, 1999
"... We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fu ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
(Show Context)
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fusing several collective operations into one; (3) performance estimates, which guide the application of optimization rules depending on the machine characteristics; (4) a simple case study with the first results of machine experiments.
(De)Composition Rules for Parallel Scan and Reduction
 In Proc. 3rd Int. Working Conf. on Massively Parallel Programming Models (MPPM'97
, 1998
"... We study the use of welldefined building blocks for SPMD programming of machines with distributed memory. Our general framework is based on homomorphisms, functions that capture the idea of dataparallelism and have a close correspondence with collective operations of the MPI standard, e.g., scan an ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We study the use of welldefined building blocks for SPMD programming of machines with distributed memory. Our general framework is based on homomorphisms, functions that capture the idea of dataparallelism and have a close correspondence with collective operations of the MPI standard, e.g., scan and reduction. We prove two composition rules: under certain conditions, a composition of a scan and a reduction can be transformed into one reduction, and a composition of two scans into one scan. As an example of decomposition, we transform a segmented reduction into a composition of partial reduction and allgather. The performance gain and overhead of the proposed composition and decomposition rules are assessed analytically for the hypercube and compared with the estimates for some other parallel models.
MATHEMATICAL ENGINEERING TECHNICAL REPORTS Generatorbased GG Fortress Library —Collection of GGs and Theories—
, 2008
"... scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the ..."
Abstract
 Add to MetaCart
scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder. Generatorbased GG Fortress Library —Collection of GGs and Theories—
(letzte Änderung: 14.09.2000) Making C++ Ready for Algorithmic Skeletons
, 2000
"... Abstract. Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Since now their implementation and use was restricted to either functional, or some sophisticated imperative languages. In this paper we will discuss ..."
Abstract
 Add to MetaCart
Abstract. Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Since now their implementation and use was restricted to either functional, or some sophisticated imperative languages. In this paper we will discuss how far C++ supports the integration of algorithmic skeletons and identify currying as the only missing feature. We will show how this gap can be closed, by integrating currying into C++ through code that is compliant with the ANSI/ISO standard, thus, by using the language itself instead of extending it. We will prove that our method does not yield any runtime penalties if a highly optimizing C++ compiler is used and, therefore, is competitive with existing sophisticated languages. 1
How to Realize DataParallel Algorithmic Skeletons with C++
"... Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Until now their implementation and use was restricted to either functional, or sophisticated imperative languages. In this paper we will show that C++ provides ..."
Abstract
 Add to MetaCart
Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Until now their implementation and use was restricted to either functional, or sophisticated imperative languages. In this paper we will show that C++ provides almost all features that are necessary to implement algorithmic skeletons and identify currying as the only one being not an intrinsic of the language. We will demonstrate how to add currying to the language by relying on languageimmanent features and how to realize it without having a negative influence on the runtime performance. 1
Making C++ Ready for Algorithmic Skeletons
, 2000
"... . Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Since now their implementation and use was restricted to either functional, or some sophisticated imperative languages. In this paper we will discuss how ..."
Abstract
 Add to MetaCart
. Many authors have proposed the use of algorithmic skeletons as a high level, machine independent means of developing parallel applications. Since now their implementation and use was restricted to either functional, or some sophisticated imperative languages. In this paper we will discuss how far C++ supports the integration of algorithmic skeletons and identify currying as the only missing feature. We will show how this gap can be closed, by integrating currying into C++ through code that is compliant with the ANSI/ISO standard, thus, by using the language itself instead of extending it. We will prove that our method does not yield any runtime penalties if a highly optimizing C++ compiler is used and, therefore, is competitive with existing sophisticated languages. 1 Introduction Algorithmic skeletons represent an approach to parallel programming. The basic idea is to replace explicit parallel programming (e.g. using a parallel language, or a message passing library), b...
On the Parallel Implementation of a Generalized Broadcast
"... We prove the correctness of optimized parallel implementations of a generalized broadcast, in which a value b is distributed to a sequence of processors, indexed from 0 upwards, such that processor i receives g i b (i.e., some function g applied i times to b). Its straightforward implementation is ..."
Abstract
 Add to MetaCart
We prove the correctness of optimized parallel implementations of a generalized broadcast, in which a value b is distributed to a sequence of processors, indexed from 0 upwards, such that processor i receives g i b (i.e., some function g applied i times to b). Its straightforward implementation is of linear time complexity in the number of processors. This type of broadcast occurs when combining scans with an ordinary broadcast. The optimized parallel implementations we describe is based on an oddeven tree and has logarithmic time complexity.
From a Tabular Classication to Parallel Implementations of Linearly Recursive Functions
, 1997
"... We propose a classication for a set of linearly recursive functions, which can be expressed as instances of a skeleton for parallel linear recursion, and present new parallel implementations for them. This set includes well known higherorder functions, like Broadcast, Reduction and Scan, which we c ..."
Abstract
 Add to MetaCart
We propose a classication for a set of linearly recursive functions, which can be expressed as instances of a skeleton for parallel linear recursion, and present new parallel implementations for them. This set includes well known higherorder functions, like Broadcast, Reduction and Scan, which we call basic components. Many compositions of these basic components are also linearly recursive functions; we present transformation rules from compositions of up to three basic components to instances of our skeleton. The advantage of this approach is that these instances have better parallel implementations than the compositions of the individual implementations of the corresponding basic components. Keywords: functional programming, linear recursion, parallelization, skeletons 1 Introduction Functional programming ooeers a very highlevel approach to specifying executable problem solutions. For example, the scheme of linear recursion can be expressed concisely as a higherorder function. I...
c © SpringerVerlag 1998 On linear list recursion in parallel?
, 1997
"... Abstract. We propose a classification for a set of linearly recursive functions, which can be expressed as instances of a skeleton for parallel linear recursion, and present new parallel implementations for them. This set includes well known higherorder functions, like Broadcast, Reduction and Sc ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We propose a classification for a set of linearly recursive functions, which can be expressed as instances of a skeleton for parallel linear recursion, and present new parallel implementations for them. This set includes well known higherorder functions, like Broadcast, Reduction and Scan, which we call basic components. Many compositions of these basic components are also linearly recursive functions; we present transformation rules from compositions of up to three basic components to instances of our skeleton. The advantage of this approach is that these instances have better parallel implementations than the compositions of the individual implementations of the corresponding basic components. 1