• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Memory-Hierarchy Management (1992)

by S Carr
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 37
Next 10 →

Improving Data Locality with Loop Transformations

by Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng - ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS , 1996
"... ..."
Abstract - Cited by 275 (25 self) - Add to MetaCart
Abstract not found

Unifying Data and Control Transformations for Distributed Shared-Memory Machines

by Michal Cierniak, Wei Li , 1994
"... We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimiz ..."
Abstract - Cited by 150 (10 self) - Add to MetaCart
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy. Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations s...

Compiler Blockability of Numerical Algorithms

by Steve Carr, Ken Kennedy - IN PROCEEDINGS OF SUPERCOMPUTING '92 , 1992
"... Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Unfortunately, memory speeds have not kept pace. The result is an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more ..."
Abstract - Cited by 94 (4 self) - Add to MetaCart
Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Unfortunately, memory speeds have not kept pace. The result is an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more complicated memory hierarchies. In turn, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-specific programs. This paper describes our investigation into compiler technology designed to obviate the need for machine-specific programming. Our results reveal that through the use of compiler optimizations many numerical algorithms can be expressed in a natural form while retaining good memory performance.

Improving the Ratio of Memory Operations to Floating-Point Operations in Loops

by Steve Carr, Ken Kennedy - ACM Transactions on Programming Languages and Systems , 1994
"... this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipe ..."
Abstract - Cited by 91 (16 self) - Add to MetaCart
this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the balance between memory operations and floating-point operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations. Experiments with our automatic techniques show that integer-factor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurate---experiments reveal little difference between the balance achieved by our automatic system and that possible by hand optimization. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors---Compilers ;

PASSION: Parallel And Scalable Software for Input-Output

by Alok Choudhary, Rajesh Bordawekar, Michael Harry, Rakesh Krishnaiyer, Ravi Ponnusamy, Tarvinder Singh, Rajeev Thakur , 1994
"... \We are developing a software system called PASSION: Parallel And Scalable Software for Input-Output which provides software support for high performance parallel I/O. PASSION provides support at the language, compiler, runtime as well as file system level. PASSION provides runtime procedures for pa ..."
Abstract - Cited by 72 (35 self) - Add to MetaCart
\We are developing a software system called PASSION: Parallel And Scalable Software for Input-Output which provides software support for high performance parallel I/O. PASSION provides support at the language, compiler, runtime as well as file system level. PASSION provides runtime procedures for parallel access to files (read/write), as well as for out-of-core computations. These routines can either be used together with a compiler to translate out-of-core data parallel programs written in a language like HPF, or used directly by application programmers. A number of optimizations such as Two-Phase Access, Data Sieving, Data Prefetching and Data Reuse have been incorporated in the PASSION Runtime Library for improved performance. PASSION also provides an initial framework for runtime support for out-of-core irregular problems. The goal of the PASSION compiler is to automatically translate out- of-core data parallel programs to node programs for distributed memory machines, with calls to the PASSION Runtime Library. At the language level, PASSION suggests extensions to HPF for out-of-core programs. At the file system level, PASSION provides support for buffering and prefetching data from disks. A portable parallel file system is also being developed as part of this project, which can be used across homogeneous or heterogeneous networks of workstations. PASSION also provides support for integrating data and task parallelism using parallel I/O techniques. We have used PASSION to implement a number of out-of-core applications such as a Laplace's equation solver, 2D FFT, Matrix Multiplication, LU Decomposition, image processing applications as well as unstructured mesh kernels in molecular dynamics and computational fluid dynamics. We are currently in the process of using PASSION in applications in CFD (3D turbulent flows), molecular structure calculations, seismic computations, and earth and space science applications such as Four-Dimensional Data Assimilation. PASSION is currently available on the Intel Paragon, Touchstone Delta and iPSC/860. Efforts are underway to port it to the IBM SP-1 and SP-2 using the Vesta Parallel File System.

Improving effective bandwidth through compiler enhancement of global cache reuse

by Chen Ding - In Proceedings of International Parallel and Distributed Processing Symposium , 2001
"... While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent o ..."
Abstract - Cited by 62 (17 self) - Add to MetaCart
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy

Register Promotion in C Programs

by Keith D. Cooper, John Lu - Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI-97 , 1997
"... The combination of pointers and pointer arithmetic in C makes the task of improving C programs somewhat more difficult than improving programs written in simpler languages like Fortran. While much work has been published that focuses on the analysis of pointers, little has appeared that uses the res ..."
Abstract - Cited by 46 (3 self) - Add to MetaCart
The combination of pointers and pointer arithmetic in C makes the task of improving C programs somewhat more difficult than improving programs written in simpler languages like Fortran. While much work has been published that focuses on the analysis of pointers, little has appeared that uses the results of such analysis to improve the code compiled for C. This paper examines the problem of register promotion in C and presents experimental results showing that it can have dramatic effects on memory traffic. 1 Introduction The presence of pointer-valued variables in C has long been recognized as an impediment to effective compiletime optimization. Pointers introduce a degree of uncertainty into the results of static analysis. Pointer assignments create multiple names for storage locations, with the result that the compiler must avoid reordering stores to memory. Pointer arithmetic introduces further ambiguity; understanding the results of *(p+8) requires a detailed knowledge, at compil...

Memory Hierarchy Management for Iterative Graph Structures

by Ibraheem Al-furaih , et al. , 1998
"... The increasing gap in processor and memory speeds has forced microprocessors to rely on deep cache hierarchies to keep the processors from starving for data. For many applications, this results in a wide disparity between sustained and peak achievable speed. Applications need to be tuned to processo ..."
Abstract - Cited by 31 (0 self) - Add to MetaCart
The increasing gap in processor and memory speeds has forced microprocessors to rely on deep cache hierarchies to keep the processors from starving for data. For many applications, this results in a wide disparity between sustained and peak achievable speed. Applications need to be tuned to processor and memory system architectures for cache locality, memory layout and data prefetch and reuse. In this paper we investigate optimizations for unstructured iterative applications in which the computational structure remains static or changes only slightly through iterations. Our methods reorganize the data elements to obtain better memory system performance without modifying code fragments. Our experimental results show that the overall time can be reduced significantly using our optimizations. Further, the overhead of our methods is small enough that they are applicable even if the computational structure does not substantially change for tens of iterations.

Combining Optimization for Cache and Instruction-Level Parallelism

by Steve Carr - in PACT '96 , 1996
"... Current architectural trends in instruction-level parallelism (ILP) have significantly increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resou ..."
Abstract - Cited by 26 (3 self) - Add to MetaCart
Current architectural trends in instruction-level parallelism (ILP) have significantly increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops [3, 6, 7, 14, 16, 17] or on improving cache performance [9, 15, 18]. This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. We have implemented the technique in a source-to-source transformation system called Memoria [5]. Preliminary experiments reveal that dramatic performance improvements for nested loops are obtainable (we regularly get at least a factor of 2 on kerne...

Improving Software Pipelining With Unroll-and-Jam

by Steve Carr, Chen Ding, Philip Sweany - In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Maui, HI , 1996
"... To take advantage of recent architectural improvements in microprocessors, advanced compiler optimizations such as software pipelining have been developed [1, 2, 3, 4]. Unfortunately, not all loops have enough parallelism in the innermost loop body to take advantage of all of the resources a machine ..."
Abstract - Cited by 25 (2 self) - Add to MetaCart
To take advantage of recent architectural improvements in microprocessors, advanced compiler optimizations such as software pipelining have been developed [1, 2, 3, 4]. Unfortunately, not all loops have enough parallelism in the innermost loop body to take advantage of all of the resources a machine provides. Unroll-and-jam is a transformation that can be used to increase the amount of parallelism in the innermost loop body by making better use of resources and limiting the effects of recurrences [5, 6]. In this paper, we demonstrate how unroll-and-jam can significantly improve the initiation interval in a software-pipelined loop. Improvements in the initiation interval of greater than 40% are common, while dramatic improvements of a factor of 5 are possible. 1 Introduction Over the past decade, the computer industry has realized dramatic improvements in the power of microprocessors. These gains have been achieved both by cycle-time improvements and by architectural innovations like m...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University