Results 1  10
of
58
POEMS: EndtoEnd Performance Design of Large Parallel Adaptive Computational Systems
, 2001
"... ..."
(Show Context)
Transforming Loops to Recursion for MultiLevel Memory Hierarchies
, 2000
"... Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on both multilevel memory hierarchies and on sharedmemory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
(Show Context)
Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on both multilevel memory hierarchies and on sharedmemory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that is simultaneously blocked at many different levels. Most existing applications, however, are written using ordinary loops. We present a new compiler transformation that can be used to convert loop nests into recursive form automatically. We show that the algorithm is fast and effective, handling loop nests with arbitrary nesting and control flow. The transformation achieves substantial performance improvements for several linear algebra codes even on a current system with a two level cache hierarchy. As a sideeffect of this work, we also develop an improved algorithm for transitive dependence analysis (a powerful technique used in the recursion transformation and other loop transformations) that is much faster than the best previously known algorithm in practice.
High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes
, 1998
"... With current compilers for High Performance Fortran (HPF), substantial restructuring and handoptimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can ..."
Abstract

Cited by 33 (18 self)
 Add to MetaCart
With current compilers for High Performance Fortran (HPF), substantial restructuring and handoptimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can provide consistently high performance for a broad spectrum of scientific applications with minimal restructuring of existing Fortran 77 or Fortran 90 applications. This paper presents four new optimization techniques we developed to support efficient parallelization of codes with minimal restructuring. These optimizations include computation partition selection for loop nests that use privatizable arrays, along with partial replication of boundary computations to reduce communication overhead; communicationsensitive loop distribution to eliminate innerloop communications; interprocedural selection of computation partitions; and data availability analysis to eliminate redundant communication ...
Implementation of NAS Parallel Benchmarks in High Performance Fortran
"... We present an HPF implementation of BT, SP, LU, FT, CG and MG of the NPB2.3serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
We present an HPF implementation of BT, SP, LU, FT, CG and MG of the NPB2.3serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the results with NPB2.3. We discuss advantages and limitations of HPF and pghpf compiler.
Program Locality Analysis Using Reuse Distance
, 2009
"... On modern computer systems, the memory performance of an application depends on its locality. For a single execution, localitycorrelated measures like average miss rate or workingset size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive acc ..."
Abstract

Cited by 27 (12 self)
 Add to MetaCart
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, localitycorrelated measures like average miss rate or workingset size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reusedistance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used online with the help of distancebased sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions. The two techniques are among the first to enable quantitative analysis of wholeprogram locality and
Increasing temporal locality with skewing and recursive blocking
 In Proc. SC2001
, 2001
"... We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multidimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bidirectional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works interprocedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an interprocedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27 % and secondary cache misses by 119%. 1.
Application Representations for MultiParadigm Performance Modeling Of LargeScale Parallel Scientific Codes
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2000
"... Effective performance prediction for large parallel applications on very largescale systems requires a comprehensive modeling approach that combines analytical models, simulation models, and measurement for different application and system components. This paper presents a common parallel program r ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
Effective performance prediction for large parallel applications on very largescale systems requires a comprehensive modeling approach that combines analytical models, simulation models, and measurement for different application and system components. This paper presents a common parallel program representation designed to support such a comprehensive approach, with four design goals: (1) the representation must support a wide range of modeling techniques; (2) it must be automatically computable using parallelizing compiler technology, in order to minimize the need for user intervention; (3) it must be efficient and scalable enough to model teraflopscale applications; and (4) it should be flexible enough to capture the performance impact of changes to the application, including changes to the parallelization strategy, communication, and scheduling. The representation we present is based on a combination of static and dynamic task graphs. It exploits recent compiler advances that make it possible to use concise, symbolic static graphs and to instantiate dynamic graphs. This representation has led to the development of a compilersupported simulation approach that can simulate regular, messagepassing programs on systems or problems 10100 times larger than was possible with previous stateoftheart simulation techniques.
POEMS: EndtoEnd Performance Design of Large Parallel Adaptive Computational Systems
 In Proceedings of First International Workshop on Software and Performance (WOSP
, 1998
"... The POEMS project is creating an environment for endto end performance modeling of complex parallel and distributed systems, spanning the domains of application software, runtime and operating system software, and hardware architecture. To enable endtoend modeling of largescale applications and ..."
Abstract

Cited by 22 (10 self)
 Add to MetaCart
(Show Context)
The POEMS project is creating an environment for endto end performance modeling of complex parallel and distributed systems, spanning the domains of application software, runtime and operating system software, and hardware architecture. To enable endtoend modeling of largescale applications and systems, the POEMS framework is designed to compose models of system components from these different domains, to integrate multiple modeling paradigms (analytical modeling, simulation, and actual system execution), and to allow different components to be modeled at multiple levels of detail. The key components of the POEMS framework include a generalized task graph model for describing parallel computations, automatic generation of the task graph by a parallelizing compiler, a specification language for mapping the computation on models for operating system and hardware components, a library of analytical and simulation models for components from the different domains, and a knowledge base d...
Generalized multipartitioning for multidimensional arrays
 In Proceedings of the International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL
, 2002
"... Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multidimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. Thi ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multidimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing multipartitionings for ddimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that exploits generalized multipartitioning and show that the compiler’s generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance. 1.
Compiler support for exploiting coarsegrained pipelined parallelism
 In Supercomputing
, 2003
"... The emergence of grid and a new class of datadriven applications is making a new form of parallelism desirable, which we refer to as coarsegrained pipelined parallelism. This paper reports on a compilation system developed to exploit this form of parallelism. We use a dialect of Java that exposes ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
The emergence of grid and a new class of datadriven applications is making a new form of parallelism desirable, which we refer to as coarsegrained pipelined parallelism. This paper reports on a compilation system developed to exploit this form of parallelism. We use a dialect of Java that exposes both pipelined and data parallelism to the compiler. Our compiler is responsible for selecting a set of candidate filter boundaries, determining the volume of communication required if a particular boundary is chosen, performing the decomposition, and generating code. We have developed a onepass algorithm for determining the required communication between consecutive filters. We have developed a cost model for estimating the execution time for a given decomposition, and a dynamic programming algorithm for performing the decomposition. Detailed evaluation of our current compiler using four datadriven applications demonstrate the feasibility of our approach. 1.