Results 11 - 20
of
54
Performance Analysis of pC++: A Portable Data-Parallel Programming System for Scalable Parallel Computers
- Proc. 8th Int. Parallel Processing Symb. (IPPS), Canc'un, Mexico, IEEE Computer
, 1994
"... pC++ is a language extension to C++ designed to allow programmers to compose distributed data structures with parallel execution semantics. These data structures are organized as "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel ma ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
pC++ is a language extension to C++ designed to allow programmers to compose distributed data structures with parallel execution semantics. These data structures are organized as "concurrent aggregate" collection classes which can be aligned and distributed over the memory hierarchy of a parallel machine in a manner consistent with the High Performance Fortran Forum (HPF) directives for Fortran 90. pC++ allows the user to write portable and efficient code which will run on a wide range of scalable parallel computers. In this paper, we discuss the performance analysis of the pC++ programming system. We describe the performance tools developed and include scalability measurements for four benchmark programs: a "nearest neighbor" grid computation, a fast Poisson solver, and the "Embar" and "Sparse" codes from the NAS suite. In addition to speedup numbers, we present a detailed analysis highlighting performance issues at the language, runtime system, and target system levels. 1 Introducti...
Installation Guide to mpich, a Portable Implementation of MPI
, 1996
"... 1 1 Quick Start 1 2 Obtaining and Unpacking the Distribution 3 3 Documentation 5 4 Conguring mpich 5 4.1 Building a production mpich . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Preparing mpich for TotalView debugging . . . . . . . . . . . . . . . . . . . 16 4.3 What if there is no Fo ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
1 1 Quick Start 1 2 Obtaining and Unpacking the Distribution 3 3 Documentation 5 4 Conguring mpich 5 4.1 Building a production mpich . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Preparing mpich for TotalView debugging . . . . . . . . . . . . . . . . . . . 16 4.3 What if there is no Fortran compiler? . . . . . . . . . . . . . . . . . . . . . 16 4.4 Conguring with the Absoft Fortran Compiler . . . . . . . . . . . . . . . . . 16 4.5 Fortran 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 Special issues for heterogeneous networks . . . . . . . . . . . . . . . . . . . 17 4.7 Conguring with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Compiling mpich 18 5.1 Getting tcl, tk, and wish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2 Building multiple devices or architectures . . . . . . . . . . . . . . . . . . . 19 6 Running an MPI Program 19 7 MPE Library 19 7.1 Congure Options . . . . . . . ....
Program Analysis Environments for Parallel Language Systems: The tau Environment
- In Proceedings of the 2nd Workshop on Environments and Tools for Parallel Scientific Computing
, 1994
"... In this paper, we discuss ø (TAU, Tuning and Analysis Utilities), the first prototype of an integrated and portable program analysis environment for pC++ , a parallel object-oriented language system. ø is unique in that it was developed specifically for pC++ and relies heavily on pC++ 's compiler an ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
In this paper, we discuss ø (TAU, Tuning and Analysis Utilities), the first prototype of an integrated and portable program analysis environment for pC++ , a parallel object-oriented language system. ø is unique in that it was developed specifically for pC++ and relies heavily on pC++ 's compiler and transformation tools (specifically, the Sage ++ toolkit) for its implementation. This tight integration allows ø to achieve a combination of portability, functionality, and usability not commonly found in high-level language environments. The paper describes the design and functionality of ø , using a new tool for breakpoint-based program analysis as an example of ø 's capabilities. 1 Introduction The trend towards using high-level parallel language systems to program scalable parallel computers must be accompanied by advances in the tools and environments for program analysis and tuning. The language system concerns are achieving programmability through parallel programming abstractions...
I/O Characterization of a Portable Astrophysics Application on the IBM SP and Intel Paragon
, 1995
"... Many large-scale applications on parallel machines are bottlenecked by the I/O performance rather than the CPU or communication performance of the system. To improve the I/O performance, it is first necessary for system designers to understand the I/O requirements of various applications. This pa ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Many large-scale applications on parallel machines are bottlenecked by the I/O performance rather than the CPU or communication performance of the system. To improve the I/O performance, it is first necessary for system designers to understand the I/O requirements of various applications. This paper presents the results of a study of the I/O characteristics and performance of a real, I/O-intensive, portable, parallel application in astrophysics, on two different parallel machines---the IBM SP and the Intel Paragon. We instrumented the source code to record all I/O activity, and analyzed the resulting trace files. Our results show that, for this application, the I/O consists of fairly large writes, and writing data to files is faster on the Paragon, whereas opening and closing files are faster on the SP. We also discuss how the I/O performance of this application could be improved; particularly, we believe that this application would benefit from using collective I/O.
User's Guide for MPE: Extensions for MPI Programs
- Argonne National Laboratory
, 1998
"... � C ..."
Performance Analysis of MPI Programs
"... The Message Passing Interface (MPI) standard has recently been completed. MPI is a specification for a library of functions that implement the message-passing model of parallel computation. One novel feature of MPI is its very general "profiling interface," that allows users to attach assorted pr ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
The Message Passing Interface (MPI) standard has recently been completed. MPI is a specification for a library of functions that implement the message-passing model of parallel computation. One novel feature of MPI is its very general "profiling interface," that allows users to attach assorted profiling tools to the MPI library even though they do not have access to the MPI source code. We describe the MPI profiling interface and describe three profiling libraries that make use of it. These libraries are distributed with the portable, publicly available implementation of MPI.
Time-Parallel Computation of Pseudo-Adjoints for a Leapfrog Scheme
- Preprint ANL/MCS-P639-0197, Mathematics and Computer Science Division, Argonne National Laboratory
, 1997
"... The leapfrog scheme is a commonly used second-order difference scheme for solving differential equations. If Z(t) denotes the state of the system at time t, the leapfrog scheme computes the state at the next time step as Z(t + 1) = H(Z(t); Z(t \Gamma 1); W ), where H is the nonlinear timestepping ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The leapfrog scheme is a commonly used second-order difference scheme for solving differential equations. If Z(t) denotes the state of the system at time t, the leapfrog scheme computes the state at the next time step as Z(t + 1) = H(Z(t); Z(t \Gamma 1); W ), where H is the nonlinear timestepping operator and W are parameters that are not time dependent. In this article, we show how the associativity of the chain rule of differential calculus can be used to compute a so-called adjoint x T \Delta (dZ(t)=d[Z(0);W ]) efficiently in a parallel fashion. To this end, we (1) employ the reverse mode of automatic differentiation at the outermost level, (2) use a sparsity-exploiting incarnation of the forward mode of automatic differentiation to compute derivatives of H at every time step, and (3) exploit chain rule associativity to compute derivatives at individual time steps in parallel. We report on experimental results with a 2-D shallow-water equation model problem on an IBM SP parallel...
Analyzing Message Passing Programs on the CRAY T3E with PAT and VAMPIR
, 1998
"... Writing efficient parallel programs for a massively parallel system like the CRAY T3E is still a difficult task because such programs are typicallyveryly 4 and complex, not trivially parallelizable and their dynamic behavior is difficult9 understand or predict. Therefore, runtime performance analysi ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Writing efficient parallel programs for a massively parallel system like the CRAY T3E is still a difficult task because such programs are typicallyveryly 4 and complex, not trivially parallelizable and their dynamic behavior is difficult9 understand or predict. Therefore, runtime performance analysis tools are41225 on such systems in addition to the normal programming environment tools like editors - debuggers. For the CRAY T3E, Silicon Graphics/Cray Research implemented and provides two performance analysis tools, Apprentice and PAT. Apprentice is a profiling tool ng 4 uses source code instrumentation through compiler switches andprovides38590 9 on the level of functions and basic blocks. PAT, the Performance Analysis Tool,s actually several tools in one. It provides profiling through sampling and accesstos 41896 performance information. It also includes an object code instrumenterwhichme be used for detailed call site profiling and gathering of functionlevel 37162 performance statisti...
Orca: a Portable User-Level Shared Object System
, 1996
"... Orca is an object-based distributed shared memory system that is designed for writing portable and efficient parallel programs. Orca hides the communication substrate from the programmer by providing an abstract communication model based on shared objects. Mutual exclusion and condition synchronizat ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Orca is an object-based distributed shared memory system that is designed for writing portable and efficient parallel programs. Orca hides the communication substrate from the programmer by providing an abstract communication model based on shared objects. Mutual exclusion and condition synchronization are cleanly integrated in the model. Orca has been implemented using a layered system, consisting of a compiler, a runtime system, and a virtual machine (Panda). To implement shared objects efficiently on a distributed-memory machine, the Orca compiler generates regular expressions describing how shared objects are accessed. The runtime system uses this information together with runtime statistics to decide which objects to replicate and where to store nonreplicated objects. The Orca system has been implemented on a range of platforms (including Solaris, Amoeba, Parix, and the CM-5). Measurements of several benchmarks and applications across four platforms show that the new Orca system a...
Speedy: An Integrated Performance Extrapolation Tool for pC++ Programs
- In Quantitative Evaluation of Computing and Communication Systems: Proceedings of the 8th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, volume 977 of Lecture Notes in Computer Science
, 1995
"... . Performance extrapolation is the process of evaluating the performance of a parallel program in a target execution environment using performance information obtained for the same program in a different environment. Performance extrapolation techniques are suited for rapid performance tuning of par ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
. Performance extrapolation is the process of evaluating the performance of a parallel program in a target execution environment using performance information obtained for the same program in a different environment. Performance extrapolation techniques are suited for rapid performance tuning of parallel programs, particularly when the target environment is unavailable. This paper describes one such technique that was developed for data-parallel C++ programs written in the pC++ language. In pC++, the programmer can distribute a collection of objects to various processors and can have methods invoked on those objects execute in parallel. Using performance extrapolation in the development of pC++ applications allows tuning decisions to be made in advance of detailed execution measurements. The pC++ language system includes t, an integrated environment for analyzing and tuning the performance of pC++ programs. This paper presents speedy, a new addition to t, that predicts the performa...

