Results 1 - 10
of
12
A Basic-Cycle Calculation Technique for Efficient Dynamic Data Redistribution
- In IEEE Tans. on Parallel and Distributed Systems
, 1998
"... Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a basic-cycle calculation technique to efficiently perform BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution. The main idea of the basic-cycle calculation technique is, first, to develop closed forms for computing source/destination processors of some specific array elements in a basic-cycle, which is defined as lcm(s, t)/gcd(s, t). These closed forms are then used to efficiently determine the communication sets of a basic-cycle. From the source/destination processor/data sets of a basic-cycle, we can efficiently perform a BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution. To evaluate the performance of the ...
Distribution assignment placement: Effective optimization of redistribution costs
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2002
"... Data locality and workload balance are key factors for getting high performance out of data-parallel programs on multiprocessor architectures. Data-parallel languages such as High-Performance Fortran (HPF) thus offer means allowing a programmer both to specify data distributions, as well as to chan ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Data locality and workload balance are key factors for getting high performance out of data-parallel programs on multiprocessor architectures. Data-parallel languages such as High-Performance Fortran (HPF) thus offer means allowing a programmer both to specify data distributions, as well as to change them dynamically in order to maintain these properties. On the other hand, redistributions can be quite expensive and significantly degrade a program's performance. They must thus be reduced to a minimum. In this article, we present a novel, aggressive approach for avoiding unnecessary remappings which works by eliminating partially dead and partially redundant distribution changes. Basically, this approach evolves from extending and combining two algorithms for these optimizations achieving each on its own optimal results. In distinction to the sequential setting, the data-parallel setting leads naturally to a family of algorithms of varying power and efficiency allowing requirement-customized solutions. The power and flexibility of the new approach are demonstrated by various examples, which range from typical HPF fragments to real world programs. Performance measurements underline its importance and show its effectivity on different hardware platforms and different settings.
A generalized processor mapping technique for array redistribution
- IEEE Trans. Parallel Distributed Systems
, 2001
"... AbstractÐIn many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the e ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
AbstractÐIn many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange for BLOCK-CYCLIC(kr) toBLOCK-CYCLIC(r) array redistribution and vice versa. The main idea of the generalized processor mapping technique is first to develop mapping functions for computing a new rank of each destination processor. Based on the mapping functions, a new logical sequence of destination processors can be derived. The new logical processor sequence is then used to minimize the amount of data exchange in a redistribution. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional array redistribution. We present a theoretical model to analyze the performance improvement of the generalized processor mapping technique. To evaluate the performance of the proposed technique, we have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems. Index TermsÐArray redistribution, generalized processor mapping, distributed memory multicomputers, runtime support. 1
Distribution Assignment Placement
- IN IN PROC. OF EURO-PAR '97, LNCS 1300
, 1997
"... The change of distributions of arrays at runtime in languages such as High Performance Fortran (HPF) addresses the demands posed by advanced applications with dynamically varying processor workloads or varying computational kernels. We introduce distribution assignments which are generated by the ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
The change of distributions of arrays at runtime in languages such as High Performance Fortran (HPF) addresses the demands posed by advanced applications with dynamically varying processor workloads or varying computational kernels. We introduce distribution assignments which are generated by the compiler at subprogram boundaries and for redistribute/realign directives in order to enforce distribution changes. Distribution assignments serve as intermediate representation which allows an abstraction from high-level language constructs which simplifies the analysis task and permits a generic description of the optimization phase. In this paper an aggressive optimization based on bidirectional data flow frameworks is presented which reduces the number of executed distribution assignments more effectively than usual approaches. Instead of implementing individual optimizations, a more general problem, namely elimination of partially dead and partially redundant distribution assign...
Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF
, 1997
"... This paper presents compilation techniques to compress holes, which are memory locations mapped by useless template cells and are caused by the non-unit alignment stride in a two-level data-processor mapping. A two-level data-processor mapping provides user to specify data-processor mapping by align ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper presents compilation techniques to compress holes, which are memory locations mapped by useless template cells and are caused by the non-unit alignment stride in a two-level data-processor mapping. A two-level data-processor mapping provides user to specify data-processor mapping by aligning related array objects with a template, and then distributing the template onto the user-declared abstract processors. In a two-level data-processor mapping, there is a repeated pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the attributes of classes for the data distribution. Similarly, data distribution on a processor also has a repeated pattern. We use compression table to record the attributes of the first data distribution pattern on that processor. By using class table and compression table, hole compression can be easily and efficiently achieved. Compressing holes can save memory usage, improve spatial locality a...
Design And Optimization Of Coordination Mechanisms For Data-Parallel Tasks
, 1996
"... Data-parallel programming languages can reduce the difficulty of developing efficient applications for contemporary parallel computers. However, many applications can benefit from a mixture of task and data parallelism. We present a library-based approach that permits programmers to coordinate data- ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Data-parallel programming languages can reduce the difficulty of developing efficient applications for contemporary parallel computers. However, many applications can benefit from a mixture of task and data parallelism. We present a library-based approach that permits programmers to coordinate data-parallel tasks using explicit message-passing operations. We discuss in detail the design of a prototype library that supports inter-task transfers of arrays in an efficient manner on distributed-memory multicomputers. Measurements with a synthetic benchmark show that in many cases the library can realize a significant fraction of a multicomputer 's peak communication performance, and reveal the sources of overheads that reduce the library's performance in other cases. We also develop an analytic model of array transfer performance as a means of predicting inter-task communication costs. iii For Jingjun, for pushing me to finish iv Acknowledgments First of all, I thank my supervisor at A...
Distribution Assignment Placement: A New Aggressive Approach for Optimizing Redistribution Costs
, 1997
"... Dynamic data redistribution is a key technique for maintaining data locality and workload balance in data-parallel languages like High Performance Fortran (HPF). On the other hand, redistributions can be very expensive and significantly degrade a program's performance. In this article, we present a ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Dynamic data redistribution is a key technique for maintaining data locality and workload balance in data-parallel languages like High Performance Fortran (HPF). On the other hand, redistributions can be very expensive and significantly degrade a program's performance. In this article, we present a novel and aggressive approach for avoiding unnecessary remappings by eliminating partially dead and partially redundant distribution changes. Basically, this approach evolves from extending and combining two algorithms for these optimizations achieving optimal results for sequential programs. Optimality, however, becomes more intricate by the combination. Unlike the sequential setting the data-parallel setting leads to a hierarchy of algorithms of varying power and efficiency fitting a user's individual needs. The power and flexibility of the new approach are demonstrated by illustrating examples. First practical experiences underline its importance and effectivity. Keywords: Data-parallel...
Efficient index generation for compiling two-level data-processor mappings in data-parallel programs
- Journal of Parallel and Distributed Computing
"... This paper presents compilation techniques used to compress holes, which are caused by the nonunit alignment stride in a two-level data-processor mapping. Holes are the memory locations mapped by useless template cells. To fully utilize the memory space, memory holes should be removed. In a two-leve ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents compilation techniques used to compress holes, which are caused by the nonunit alignment stride in a two-level data-processor mapping. Holes are the memory locations mapped by useless template cells. To fully utilize the memory space, memory holes should be removed. In a two-level data-processor mapping, there is a repetitive pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the distribution of each class in the first repetitive data distribution pattern. Similarly, data distribution on a processor also has a repetitive pattern. We use a compression table to record the distribution of each block in the first repetitive data distribution pattern on a processor. By using a class table and a compression table, hole compression can be easily and efficiently achieved. Compressing holes can save memory usage, improve spatial locality and further improve system performance. The proposed method is efficient, stable, and easy to implement. The experimental results do confirm the advantages of our proposed method over existing methods. 2000 Academic Press Key Words: communication set; distributed-memory multicomputers; high performance Fortran; hole compression; two-level data-processor mapping.
Optimization of Data Remapping in Data-Parallel Languages
, 1998
"... The user-controlled mapping of data across the local memories of processing nodes is one of the central features of data-parallel languages like High Performance Fortran (HPF). Since many scientific applications typically consist of different computational phases owning each a best data mapping, dyn ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The user-controlled mapping of data across the local memories of processing nodes is one of the central features of data-parallel languages like High Performance Fortran (HPF). Since many scientific applications typically consist of different computational phases owning each a best data mapping, dynamic remappings have proven useful in maintaining good data locality and workload balance. HPF supports remappings by procedure calls and by executing redistribute/realigndirectives. But remappings can be quite expensive as communication is required to migrate the array elements to their new owning processors and can significantly degrade a program's performance. Hence, elimination of unnecessary remappings is of key importance. It is essential because even well-written HPF programs may result in unnecessary remappings. In this thesis we optimize the overall time spent for dynamic data remappings in a program run by reducing the number of executed remappings. Elimination of redundant and d...
Efficient Methods for kr R r and r R kr Array
"... Abstract. Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorit ..."
Abstract
- Add to MetaCart
Abstract. Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of our methods is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information that derived from the BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the Thakur’s methods and the PITFALLS method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the Thakur’s methods and the PITFALLS method for all test samples. This result encourages us to use the proposed algorithms for array redistribution.

