Results 1 - 10
of
133
Sequoia: Programming the Memory Hierarchy
, 2006
"... We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and p ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
A Performance Analysis of the Berkeley UPC Compiler
- ICS'03
, 2003
"... Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing be ..."
Abstract
-
Cited by 36 (16 self)
- Add to MetaCart
Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of microbenchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.
The Cascade High Productivity Language
- in Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’04
, 2004
"... The strong focus of recent High End Computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented programming model. In such a model, object code performance is achieved at the expens ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
The strong focus of recent High End Computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented programming model. In such a model, object code performance is achieved at the expense of productivity, conciseness, and clarity. This paper describes the design of Chapel, the Cascade High Productivity Language, which is being developed in the DARPA-funded HPCS project Cascade led by Cray Inc. Chapel pushes the state-of-the-art in languages for HEC system programming by focusing on productivity, in particular by combining the goal of highest possible object code performance with that of programmability offered by a high-level user interface. The design of Chapel is guided by four key areas of language technology: multithreading, locality-awareness, object-orientation, and generic programming. The Cascade architecture, which is being developed in parallel with the language, provides key architectural support for its efficient implementation. 1.
Optimizing bandwidth limited problems using one-sided communication and overlap
- In 20th International Parallel and Distributed Processing Symposium (IPDPS
, 2006
"... This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimiza ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is isolated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve> 1.9 × speedup over the base Fortran/MPI. Our one-sided versions show an average 15 % improvement over the twosided versions, due to the lower software overhead of onesided communication, whose semantics are fundamentally lighter-weight than message passing. Our UPC results use Berkeley UPC with GASNet and demonstrate the scalability of that system, with performance approaching 0.5 TFlop/s on the FT benchmark with 512 processors. 1.
A MultiPlatform Co-Array Fortran Compiler
- In Proceedings of the 13th Intl. Conference of Parallel Architectures and Compilation Techniques, Antibes Juan-les-Pins
, 2004
"... Co-array Fortran (CAF)—a small set of extensions to Fortran 90—is an emerging model for scalable, global address space parallel programming. CAF’s global address space programming model simplifies the development of singleprogram-multiple-data parallel programs by shifting the burden for managing th ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
Co-array Fortran (CAF)—a small set of extensions to Fortran 90—is an emerging model for scalable, global address space parallel programming. CAF’s global address space programming model simplifies the development of singleprogram-multiple-data parallel programs by shifting the burden for managing the details of communication from developers to compilers. This paper describes cafc—a prototype implementation of an open-source, multiplatform CAF compiler that generates code well-suited for today’s commodity clusters. The cafc compiler translates CAF into Fortran 90 plus calls to one-sided communication primitives. The paper describes key details of cafc’s approach to generating efficient code for multiple platforms. Experiments compare the performance of CAF and MPI versions of several NAS parallel benchmarks on an Alpha cluster with a Quadrics interconnect, an Itanium 2 cluster with a Myrinet 2000 interconnect and an Itanium 2 cluster with a Quadrics interconnect. These experiments show that cafc compiles CAF programs into code that delivers performance roughly equal to that of hand-optimized MPI programs. 1.
A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures
- IN SUPERCOMPUTING ’00: PROCEEDINGS OF THE 2000 ACM/IEEE CONFERENCE ON SUPERCOMPUTING
, 2000
"... Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel prog ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500, and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language's support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.
A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry
- In Proc. of Supercomputing 2002
, 2002
"... This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of ..."
Abstract
-
Cited by 24 (13 self)
- Add to MetaCart
This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a high-level specification of the computation into high-performance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.
Communication Optimizations for Fine-grained UPC Applications
- In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques
, 2005
"... Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finergrained sharing can b ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finergrained sharing can be useful for scenarios such as dynamic load balancing, event signaling, and distributed hash tables. In this paper we present three optimization techniques for global address space programs with fine-grained communication: redundancy elimination, use of split-phase communication, and communication coalescing. Parallel UPC programs are analyzed using static single assignment form and a dataflow graph, which are extended to handle the various shared and private pointer types that are available in UPC. The optimizations also take advantage of UPC’s relaxed memory consistency model, which reduces the need for cross thread analysis. We demonstrate the effectiveness of the analysis and optimizations using several benchmarks, which were chosen to reflect the kinds of finegrained, communication-intensive phases that exist in some larger applications. The optimizations show speedups of up to 70 % on three parallel systems, which represent three different types of cluster network technologies. 1
Titanium performance and potential: an NPB experimental study
- In proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC
, 2005
"... Titanium is an explicitly parallel dialect of Java TM designed for high-performance scientific programming. It offers objectorientation, strong typing, and safe memory management in the context of a language that supports high performance and scalable parallelism. We present an overview of the langu ..."
Abstract
-
Cited by 20 (11 self)
- Add to MetaCart
Titanium is an explicitly parallel dialect of Java TM designed for high-performance scientific programming. It offers objectorientation, strong typing, and safe memory management in the context of a language that supports high performance and scalable parallelism. We present an overview of the language features and demonstrate their use in the context of the NAS Parallel Benchmarks, a standard benchmark suite of kernels that are common across many scientific applications. We argue that parallel languages like Titanium provide greater expressive power than conventional approaches, enabling much more concise and expressive code and minimizing time to solution without sacrificing parallel performance. Empirical results demonstrate our Titanium implementations of three of the NAS Parallel Benchmarks can match or even exceed the performance of the standard MPI/Fortran implementations at realistic problem sizes and processor scales, while still using far cleaner, shorter and more maintainable code. 1

