Results 11 - 20
of
55
A Parallel Software Infrastructure for Dynamic Block-Irregular Scientific Calculations
, 1995
"... ..."
Improving Compiler and Run-Time Support for Irregular Reductions
, 1998
"... Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DS ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DSMs apply local reductions to replicated buffers (e.g., CVM, TreadMarks). We introduce LO-CALWRITE, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LOCALWRITE improves performance by 50-150% compared to using replicated buffers. Gather/scatter using CHAOS generally provides the best performance, but LO-CALWRITE can outperform CHAOS for applications with low locality or high adaptivity. We also discover the flushupdate coherence protocol can improve performance by 15-25 % for software DSMs over an invalidate protocol.
Improving Locality For Adaptive Irregular Scientific Codes
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
(Show Context)
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregular scientific codes for a variety of meshes show our partitioning algorithms are effective for static and adaptive codes on both sequential and parallel machines. Improved locality also enhances the effectiveness of LOCALWRITE, a parallelization technique for irregular reductions based on the owner computes rule.
Interprocedural Compilation of Fortran D
, 1996
"... Fortran D is a version of Fortran extended with data decomposition specifications. It is designed to provide a machine-independent programming model for data-parallel applications and has heavily influenced the design of High Performance Fortran (HPF). In previous work we described Fortran D compila ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Fortran D is a version of Fortran extended with data decomposition specifications. It is designed to provide a machine-independent programming model for data-parallel applications and has heavily influenced the design of High Performance Fortran (HPF). In previous work we described Fortran D compilation algorithms for individual procedures. This paper presents an interprocedural approach to analyze data & computation partitions, optimize communication, support dynamic data decomposition, and perform other tasks required to compile Fortran D programs. Our algorithms are designed to make interprocedural compilation efficient. First, we collect summary information after edits to solve important data-flow problems in a separate interprocedural propagation phase. Second, for non-recursive programs we compile procedures in reverse topological order to propagate additional interprocedural information during code generation. We thus limit compilation to a single pass over each procedure body. ...
Automatic Mapping of Task and Data Parallel Programs for Efficient Execution on Multicomputers
, 1993
"... For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploitingtask and data parallelism in a single compiler framework, and such a compiler can map a single ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploitingtask and data parallelism in a single compiler framework, and such a compiler can map a single source program in many different ways onto a parallel machine. There are several complex tradeoffs between task and data parallelism, depending on the characteristics of the program to be executed and the performance parameters of the target parallel machine. This makes it very difficult for a programmer to select a good mapping for a task and data parallel program. In this paper we isolate and examine specific characteristics of executing programs that determine the performance for different mappings on a parallel machine, and present an automatic system to obtain good mappings. The process consists of two steps: First, an instrumented input program is executed a fixed number of times with ...
Software Support For Improving Locality in Scientific Codes
, 2000
"... We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers for 3D partial differential equations have poor locality because accesses to nearby elements in higher-level dimensions are spread far apart in memory. Careful tiling and padding can frequently recapture such reuse. Second, computations on adaptive meshes and sparse matrices experience many cache misses because they access data in an irregular manner. Data layout and access order can be rearranged according to mesh connections or geometric location to improve locality, with cost models used to guide frequency of transformations for adaptive computations.
Efficient compiler and run-time support for parallel irregular reductions
, 2000
"... ..."
(Show Context)
Experimental Analysis of Parallel Systems: Techniques and Open Problems
- Techniques and Open Problems, Lect. Notes in Comp. Sci. 794
, 1994
"... . Massively parallel systems pose daunting performance instrumentation and data analysis problems. Balancing instrumentation detail, application perturbation, data reduction costs, and presentation complexity requires a mix of science, engineering, and art. This paper surveys current techniques for ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
. Massively parallel systems pose daunting performance instrumentation and data analysis problems. Balancing instrumentation detail, application perturbation, data reduction costs, and presentation complexity requires a mix of science, engineering, and art. This paper surveys current techniques for performance instrumentation and data presentation, illustrates one approach to tool extensibility, and discusses the implications of massive parallelism for performance analysis environments. 1 Introduction The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage In the past one hundred and fifty years, little has changed since Babbage's remark. Performance optimization remains a difficult and elusive goal. And as we move from vector supercomputers to parallel systems that scale from tens to thousands of processors, many of the performance instrumentation, d...
Advanced Code Generation for High Performance Fortran
- In Languages, Compilation Techniques and Run Time Systems for Scalable Parallel Systems, Lecture Notes in Computer Science Series
"... this paper, we describe techniques developed in the Rice dHPF compiler to address key code generation challenges that arise in achieving high performance for regular applications on message-passing systems. We focus on techniques required to implement advanced optimizations and to achieve consistent ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
this paper, we describe techniques developed in the Rice dHPF compiler to address key code generation challenges that arise in achieving high performance for regular applications on message-passing systems. We focus on techniques required to implement advanced optimizations and to achieve consistently high performance with existing optimizations. Many of the core communication analysis and code generation algorithms in dHPF are expressed in terms of abstract equations manipulating integer sets. This approach enables general and yet simple implementations of sophisticated optimizations, making it more practical to include a comprehensive set of optimizations in data-parallel compilers. It also enables the compiler to support much more aggressive computation partitioning algorithms than in previous compilers. We therefore believe this approach can provide higher and more consistent levels of performance than are available today. 1. Introduction