Results 1 - 10
of
12
Using Integer Sets for Data-Parallel Program Analysis and Optimization
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of ..."
Abstract
-
Cited by 55 (30 self)
- Add to MetaCart
In this paper, we describe our experience with using an abstract integer-set framework to develop the Rice dHPF compiler, a compiler for High Performance Fortran. We present simple, yet general formulations of the major computation partitioning and communication analysis tasks as well as a number of important optimizations in terms of abstract operations on sets of integer tuples. This approach has made it possible to implement a comprehensive collection of advanced optimizations in dHPF, and to do so in the context of a more general computation partitioning model than previous compilers. One potential limitation of the approach is that the underlying class of integer set problems is fundamentally unable to represent HPF data distributions on a symbolic number of processors. We describe how we extend the approach to compile codes for a symbolic number of processors, without requiring any changes to the set formulations for the above optimizations. We show experimentally that the set re...
A Comparison of Locality Transformations for Irregular Codes
, 2000
"... Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Experimental results show GPART matches the performance of more sophisticated partitioning algorithms to with 6%–8%, with a small fraction of the overhead. It is thus useful for optimizing programs whose running times are not known.
Improving Compiler and Run-Time Support for Adaptive Irregular Codes
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50--150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity. 1 Introduction Scientists are beginning to exploit parallelism to provide the computing power they need for research and development. As they attempt to model more complex problems, irregular adaptive computations become increasingly important. The core of these applications is fre...
Improving Compiler and Run-Time Support for Irregular Reductions
, 1998
"... Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DS ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DSMs apply local reductions to replicated buffers (e.g., CVM, TreadMarks). We introduce LO-CALWRITE, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LOCALWRITE improves performance by 50-150% compared to using replicated buffers. Gather/scatter using CHAOS generally provides the best performance, but LO-CALWRITE can outperform CHAOS for applications with low locality or high adaptivity. We also discover the flushupdate coherence protocol can improve performance by 15-25 % for software DSMs over an invalidate protocol.
Improving Locality for Adaptive Irregular Scientific Codes
- In 13 th Int'l Workshop on Languages and Compilers for Parallel Computing
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregul...
Software Support For Improving Locality in Scientific Codes
- In Proc. CPC2000
, 2000
"... We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers for 3D partial differential equations have poor locality because accesses to nearby elements in higher-level dimensions are spread far apart in memory. Careful tiling and padding can frequently recapture such reuse. Second, computations on adaptive meshes and sparse matrices experience many cache misses because they access data in an irregular manner. Data layout and access order can be rearranged according to mesh connections or geometric location to improve locality, with cost models used to guide frequency of transformations for adaptive computations. 1 Introduction From astrophysics to biochemistry, scientists are increasingly using computers to conduct research and development. Fuelin...
Efficient Compiler and Run-Time Support for Parallel Irregular Reductions
- Parallel Computing
, 2000
"... Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop LocalWrite, a new technique which partiti ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop LocalWrite, a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show LocalWrite improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity. Keywords: parallelizing comp...
High-level language support for user-defined reductions
- JOURNAL OF SUPERCOMPUTING
, 2002
"... The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remain ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remains surprisingly absent. Further, when such mechanisms do exist, they often do not provide the flexibility a programmer needs to achieve a desirable level of performance. In this paper, we present a new language construct for arbitrary reductions that lets a programmer achieve a level of performance equal to that achievable with the highly flexible, but low-level combination of Fortran and MPI. We have implemented this construct in the ZPL language and evaluate it in the context of the initialization of the NAS MG benchmark. We show a 45 times speedup over the same code written in ZPL without this construct. In addition, performance on a large number of processors surpasses that achieved in the NAS implementation showing that our mechanism provides programmers with the needed flexibility.
Compiler and Runtime Analysis for Efficient Communication in Data Intensive Applications
"... Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We are developing a compiler that processes data intensive applications written in a dialect of Java and compiles them for efficient execution on distributed memory parallel ma ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We are developing a compiler that processes data intensive applications written in a dialect of Java and compiles them for efficient execution on distributed memory parallel machines.
Design and Evaluation of a Computation Partitioning Framework for Data-Parallel Compilers
, 2001
"... this paper, we present the design and evaluation of a flexible computation partitioning framework used in the Rice dHPF compiler for High Performance Fortran. Our CP framework supports a more general class of static computation partitionings than previous data-parallel compilers, enables sophisticat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
this paper, we present the design and evaluation of a flexible computation partitioning framework used in the Rice dHPF compiler for High Performance Fortran. Our CP framework supports a more general class of static computation partitionings than previous data-parallel compilers, enables sophisticated partitionings that maximize parallelism in the presence of arbitrary control flow, and supports several novel optimizations that have proven essential for obtaining high overall performance when parallelizing scientific programs. In earlier work, we have shown that the dHPF compiler is able to effectively parallelize HPF versions of existing Fortran codes and achieve speedups that are comparable with hand-coded parallel performance [2, 1]. For example, code generated by dHPF for the NAS application benchmarks SP and BT is within 0--21% of the performance of sophisticated hand-coded message-passing versions of the codes, and these results are achieved with HPF versions that require changes to fewer than 6% of the lines of the original serial codes. Three new CP-based optimizations in the dHPF compiler were key to achieving this level of performance. Two of these three optimizations, along with another new algorithm presented in this paper, require the full generality of our CP framework and could not be implemented in any other existing compiler that we are aware of. In data distribution-based languages such as Fortran D [13], Vienna Fortran [11], and High Performance Fortran (HPF) [20], it is natural to represent a CP for a statement instance as the set of processor(s) that "own" a particular array element or scalar variable (by virtue of a data distribution). For example, the widely-used owner-computes rule [25] (a simple heuristic for computation partitioning selection) ...

