Results 1 -
6 of
6
Improving Compiler and Run-Time Support for Irregular Reductions
, 1998
"... Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DS ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DSMs apply local reductions to replicated buffers (e.g., CVM, TreadMarks). We introduce LO-CALWRITE, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LOCALWRITE improves performance by 50-150% compared to using replicated buffers. Gather/scatter using CHAOS generally provides the best performance, but LO-CALWRITE can outperform CHAOS for applications with low locality or high adaptivity. We also discover the flushupdate coherence protocol can improve performance by 15-25 % for software DSMs over an invalidate protocol.
Improving Locality for Adaptive Irregular Scientific Codes
- In 13 th Int'l Workshop on Languages and Compilers for Parallel Computing
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregul...
Software Support For Improving Locality in Scientific Codes
- In Proc. CPC2000
, 2000
"... We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers for 3D partial differential equations have poor locality because accesses to nearby elements in higher-level dimensions are spread far apart in memory. Careful tiling and padding can frequently recapture such reuse. Second, computations on adaptive meshes and sparse matrices experience many cache misses because they access data in an irregular manner. Data layout and access order can be rearranged according to mesh connections or geometric location to improve locality, with cost models used to guide frequency of transformations for adaptive computations. 1 Introduction From astrophysics to biochemistry, scientists are increasingly using computers to conduct research and development. Fuelin...
Software Support For Improving Locality in Advanced Scientific Codes
, 2000
"... Programs can achieve good performance only if they possess data locality, This paper describes our proposal to develop and evaluate software support for improving locality for advanced scientific applications for both sequential and parallel machines. The basic premise is that both compile-time anal ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Programs can achieve good performance only if they possess data locality, This paper describes our proposal to develop and evaluate software support for improving locality for advanced scientific applications for both sequential and parallel machines. The basic premise is that both compile-time analyses and sophisticated run-time systems are necessary. Run-time systems are needed because many programs are not analyzable statically. Compiler support is crucial both for inserting interfaces to the run-time system and for directly applying program transformations where possible. We examine locality optimizations needed for three features of advanced scientific applications (3D arrays, irregular accesses, and pointers). Preliminary experimental evaluation is very encouraging, but much work remains to automate and improve these compiler and run-time systems. We propose to extend locality optimizations in several directions, handling: cache conflicts between multiple data, deep memory hierar...
Automatic Analytical Modeling for the Estimation of Cache Misses
- In International Conference on Parallel Architectures and Compilation Techniques
, 1999
"... Caches play a very important role in the performance of modern computer systems due to the gap between the memory and the processor speed. Among the methods for studying their behavior, the most widely used by now has been trace-driven simulation. Nevertheless, analytical modeling gives more informa ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Caches play a very important role in the performance of modern computer systems due to the gap between the memory and the processor speed. Among the methods for studying their behavior, the most widely used by now has been trace-driven simulation. Nevertheless, analytical modeling gives more information and requires smaller computation times that allow it to be used in the compilation step to drive automatic optimizations on the code. The traditional drawback of analytical modeling has been its limited precision and the lack of techniques to apply it systematically without user intervention. In this work we present a methodology to build analytical models for codes with regular access patterns. These models can be applied to caches with an arbitrary size, line size and associativity. Their validation through simulations using typical scientific code fragments has proved a good degree of accuracy.
A Research Statement
"... fast processor technologies. I am interested in parallel architectures, both distributed and shared memory machines, the software involved and the challenges faced by these parallel processors. Thesis Summary: My thesis, titled "Design and Evaluation of a Uniform Compilation Framework for Hybrid A ..."
Abstract
- Add to MetaCart
fast processor technologies. I am interested in parallel architectures, both distributed and shared memory machines, the software involved and the challenges faced by these parallel processors. Thesis Summary: My thesis, titled "Design and Evaluation of a Uniform Compilation Framework for Hybrid Applications", is in the context of the PARADIGM compiler [BCG + 95] for distributed memory message-passing multi-computers. The goal of PARADIGM is to develop automatic means to parallelize sequential programs and generate code for efficient execution on distributed memory machines. Our compiler and run-time framework tries to provide efficient solutions for problems including static and dynamic distribution of data and computation onto multiple processors, both task and data parallelism, reduction of inter-processor communication, and support for irregular computations [LCB] with minimum input from the user. The goal of our syst

