Results 1 - 10
of
19
Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time
- IN PROCEEDINGS OF THE SIGPLAN ’99 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1999
"... With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesse ..."
Abstract
-
Cited by 83 (18 self)
- Add to MetaCart
With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesses in a program. Until now, however, most data reuse transformations have been static---applied only at compile time. As a result, these transformations cannot be used to optimize irregular and dynamic applications, in which the data layout and data access patterns remain unknown until run time and may even change during the computation. In this paper, we explore ways to achieve better data reuse in irregular and dynamic applications by building on the inspector-executor method used by Saltz for run-time parallelization. In particular, we present and evaluate a dynamic approach for improving both computation and data locality in irregular programs. Our results demonstrate that run-time progra...
Improving effective bandwidth through compiler enhancement of global cache reuse
- In Proceedings of International Parallel and Distributed Processing Symposium
, 2001
"... While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent o ..."
Abstract
-
Cited by 62 (17 self)
- Add to MetaCart
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy
Adaptive Reduction Parallelization Techniques
- In Proceedings of the 2000 International Conference on Supercomputing
, 2000
"... In this paper, we propose to adapt parallelizing transformations, more specifically, reduction parallelizations, to the actual reference pattern executed by a loop, i.e., to the particular input data and dynamic phase of a program. More precisely we will show how, after validating a reduction at run ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
In this paper, we propose to adapt parallelizing transformations, more specifically, reduction parallelizations, to the actual reference pattern executed by a loop, i.e., to the particular input data and dynamic phase of a program. More precisely we will show how, after validating a reduction at run-time (when this is not possible at compile time) we can dynamically characterize its reference pattern and choose the most appropriate method for parallelizing it. For this purpose, we develop a library of parallel reduction algorithms, including both previously known and novel schemes, which includes algorithms specialized for different classes of access behavior. In particular, each algorithm in our library has identified strengths related to specific reference pattern characteristics, which are matched, at run-time, with measured characteristics of the actual reference pattern. The matching of algorithm to reference pattern is performed using a decision-tree based selection scheme. The c...
Improving Locality for Adaptive Irregular Scientific Codes
- In 13 th Int'l Workshop on Languages and Compilers for Parallel Computing
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregul...
Software Support For Improving Locality in Scientific Codes
- In Proc. CPC2000
, 2000
"... We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers for 3D partial differential equations have poor locality because accesses to nearby elements in higher-level dimensions are spread far apart in memory. Careful tiling and padding can frequently recapture such reuse. Second, computations on adaptive meshes and sparse matrices experience many cache misses because they access data in an irregular manner. Data layout and access order can be rearranged according to mesh connections or geometric location to improve locality, with cost models used to guide frequency of transformations for adaptive computations. 1 Introduction From astrophysics to biochemistry, scientists are increasingly using computers to conduct research and development. Fuelin...
Database support for data-driven scientific applications
- in the grid. Parallel Processing Letters
, 2003
"... krishnan,kurc,umit,jsaltz¢ In this paper we describe a services oriented software system to provide basic database support for efficient execution of applications that make use of scientific datasets in the Grid. This system supports two core operations: efficient selection of the data of interest f ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
krishnan,kurc,umit,jsaltz¢ In this paper we describe a services oriented software system to provide basic database support for efficient execution of applications that make use of scientific datasets in the Grid. This system supports two core operations: efficient selection of the data of interest from distributed databases and efficient transfer of data from storage nodes to compute nodes for processing. We present its overall architecture and main components and describe preliminary experimental results. 1
Efficient Compiler and Run-Time Support for Parallel Irregular Reductions
- Parallel Computing
, 2000
"... Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop LocalWrite, a new technique which partiti ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop LocalWrite, a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show LocalWrite improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity. Keywords: parallelizing comp...
Locality Optimizations For Adaptive Irregular Scientific Codes
, 2000
"... Irregular scientific codes experience poor cache performance due to their memory access patterns. We examine several data and computation locality transformations including GPART, a new technique based on hierarchical clustering. GPART constructs quality partitions quickly by clustering multiple n ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Irregular scientific codes experience poor cache performance due to their memory access patterns. We examine several data and computation locality transformations including GPART, a new technique based on hierarchical clustering. GPART constructs quality partitions quickly by clustering multiple neighboring nodes in a few passes, with priority on nodes with high degree. Overhead is kept low by considering only edges between partitions. We develop compiler analyses and transformations in SUIF to automatically apply locality transformations, and propose user annotations to locate coordinate information needed by geometric partitioning algorithms. We experimentally evaluate locality optimizations for both static and adaptive codes, where connection patterns dynamically change at intervals during program execution. We derive a simple cost model to guide locality optimizations when access patterns change. Experiments on several irregular scientific codes show locality optimization t...
SmartApps: An Application Centric Approach to High Performance Computing
- in Proc. of the 13th Annual Workshop on Languages and Compilers for Parallel Computing (LCPC), Yorktown Heights
, 2000
"... Abstract. State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly fr ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application’s performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code’s input and the system’s resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware. The paper concludes with a short description of current and future development work. 1
A Comparison of Parallelization Techniques for Irregular Reductions
- 15th IEEE Int'l. Parallel and Distributed Processing Symp. (IPDPS'2001
, 2001
"... A large class of scientific applications are comprised of irregular reductions on large data sets. On shared-memory multiprocessors these reductions are typically parallelized by computing partial results into replicated buffers, then combining the values into shared data using synchronization. Rece ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
A large class of scientific applications are comprised of irregular reductions on large data sets. On shared-memory multiprocessors these reductions are typically parallelized by computing partial results into replicated buffers, then combining the values into shared data using synchronization. Recently, a number of alternative techniques have been developed based on selective privatization, local writes, and synchronized writes. In this paper, we present a more efficient version of the local write algorithm which is 56% faster on average. We then experimentally compare the performance of each technique using a number of representative kernels. Results show speedups vary greatly depending on application characteristics such as connectivity, locality, and adaptivity. In general, we find the local write technique provides the best performance, particularly when applications display good locality. 1 Introduction As scientists attempt to model ever more complex problems, their programs b...

