Results 1 - 10
of
48
Distributed memory compiler design for sparse problems
- IEEE Transactions on Computers
, 1995
"... This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accesses that make it impossible to precisely determine the reference pattern at compile time. This paper pro ..."
Abstract
-
Cited by 66 (10 self)
- Add to MetaCart
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accesses that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used e ectively by a compiler to generate e cient code in these situations. The compiler accepts as input aFortran 77 program enhanced with speci cations for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. Avariety of performance results on the Intel iPSC/860 are presented.
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
An Integrated Runtime and Compile-time Approach for Parallelizing Structured and Block Structured Applications
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1995
"... Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallel ..."
Abstract
-
Cited by 54 (12 self)
- Add to MetaCart
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. Wehave designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. To further ease the task of application programmers, wehave developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss howwehaveintegrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. Wehave exper...
The Cascade High Productivity Language
- in Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’04
, 2004
"... The strong focus of recent High End Computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented programming model. In such a model, object code performance is achieved at the expens ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
The strong focus of recent High End Computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented programming model. In such a model, object code performance is achieved at the expense of productivity, conciseness, and clarity. This paper describes the design of Chapel, the Cascade High Productivity Language, which is being developed in the DARPA-funded HPCS project Cascade led by Cray Inc. Chapel pushes the state-of-the-art in languages for HEC system programming by focusing on productivity, in particular by combining the goal of highest possible object code performance with that of programmability offered by a high-level user interface. The design of Chapel is guided by four key areas of language technology: multithreading, locality-awareness, object-orientation, and generic programming. The Cascade architecture, which is being developed in parallel with the language, provides key architectural support for its efficient implementation. 1.
Processor Mapping Techniques Toward Efficient Data Redistribution
- IEEE Trans. Parallel Distributed Systems
, 1995
"... Run-time data redistribution can enhance algorithm performance in distributedmemory machines. Explicit redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Redistributi ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Run-time data redistribution can enhance algorithm performance in distributedmemory machines. Explicit redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Redistribution, however, represents increased program overhead as algorithm computation is discontinued while data are exchanged among processor memories. In this paper, we present a technique that minimizes the amount of data exchange for BLOCK to CYCLIC(c) (or vice-versa) redistributions of arbitrary number of dimensions. Preserving the semantics of the target (destination) distribution pattern, the technique manipulates the data to logical processor mapping of the target pattern. When implemented on an IBM SP-x, the mapping technique demonstrates redistribution performance improvements of approximately 40% over traditional data to processor mapping. Relative to the traditional mapping technique, the ...
Block-Cyclic Dense Linear Algebra
- SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING
, 1992
"... Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields g ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load--balance for both the factorization and solution phases for the solution of dense systems of equations by LU and QR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS are ideal for operations local to a node. High rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update. We show that in many parallel implementations, the O(N²) work in the factorization may be of the same significance as the O(N³
Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... Vienna Fortran, High Performance Fortran (HPF) and other data parallel languages have been introduced to allow the programming of massively parallel distributed-memory machines (DMMP) at a relatively high level of abstraction based on the SPMD paradigm. Their main features include directives to expr ..."
Abstract
-
Cited by 26 (10 self)
- Add to MetaCart
Vienna Fortran, High Performance Fortran (HPF) and other data parallel languages have been introduced to allow the programming of massively parallel distributed-memory machines (DMMP) at a relatively high level of abstraction based on the SPMD paradigm. Their main features include directives to express the distribution of data and computations across the processors of a machine. In this paper, we use Vienna-Fortran as a general framework for dealing with sparse data structures. We describe new methods for the representation and distribution of such data on DMMPs, and propose simple language features that permit the user to characterize a matrix as "sparse" and specify the associated representation. Together with the data distribution for the matrix, this enables the compiler and runtime system to translate sequential sparse code into explicitly parallel message-passing code. We develop new compilation and runtime techniques, which focus on achieving storage economy and reducing communi...
A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures
- IN SUPERCOMPUTING ’00: PROCEEDINGS OF THE 2000 ACM/IEEE CONFERENCE ON SUPERCOMPUTING
, 2000
"... Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel prog ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500, and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language's support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.
Algorithmic redistribution methods for block cyclic decompositions
- IEEE Trans. on PDS
, 1996
"... ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special apprecia ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special appreciation is due to Professor Jack Dongarra, Chairman, who pro-vided sound guidance, support and appropriate commentaries during the course of my graduate study. I also would like to thank Yves Robert and R. Clint Whaley for many useful and instructive discussions on general parallel algorithms and message passing software libraries. Many valuable comments for improving the presentation of this document were received from L. Susan Blackford. Finally, I am grateful to the Department of Computer Science at the University ofTennessee for allowing me to do this doctoral research work here. A special debt of gratitude is owed to Joanne Martin, IBM POWERparallel Division, for awarding me an IBM Corporation Fellowship covering the tuition as well as a stipend for the 1994-96 academic years. This work was also supported
Minimizing the Communication Time for Matrix Multiplication on Multiprocessors
- Parallel Computing
, 1992
"... We present one matrix multiplication algorithm for two--dimensional arrays of processing nodes, and one algorithm for three--dimensional nodal arrays. One--dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high deg ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
We present one matrix multiplication algorithm for two--dimensional arrays of processing nodes, and one algorithm for three--dimensional nodal arrays. One--dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one--, two--, or three--dimensional arrays may be embedded. For binary n-cubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square two--dimensional arrays by a factor of n 2 . Configuring the N = 2 n processing nodes as a three-dimensional array may reduce the communication complexity by a factor of N 1 6 compared to a two--dimensional nodal array. The three--dimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal two--dimensional nodal array shape with respect to communicati...

