• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Automatic and Interactive Parallelization (1992)

by K S McKinley
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 26
Next 10 →

Improving Data Locality with Loop Transformations

by Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng - ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS , 1996
"... ..."
Abstract - Cited by 275 (25 self) - Add to MetaCart
Abstract not found

Compiler Optimizations for Improving Data Locality

by Steve Carr, Kathryn S. McKinley, Chau-Wen Tseng - IN PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS , 1994
"... In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality base ..."
Abstract - Cited by 180 (17 self) - Add to MetaCart
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality basedon a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the origi...

Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

by Ken Kennedy, Kathryn S. McKinley - IN LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING , 1994
"... Loop fusion is a program transformation that merges multiple loops into one. It is effective for reducing the synchronization overhead of parallel loops and for improving data locality. This paper presents three results for fusion: (1) a new algorithm for fusing a collection of parallel and seq ..."
Abstract - Cited by 110 (10 self) - Add to MetaCart
Loop fusion is a program transformation that merges multiple loops into one. It is effective for reducing the synchronization overhead of parallel loops and for improving data locality. This paper presents three results for fusion: (1) a new algorithm for fusing a collection of parallel and sequential loops, minimizing parallel loop synchronization while maximizing parallelism; (2) a proof that performing fusion to maximize data locality is NP-hard; and (3) two polynomial-time algorithms for improving data locality. These techniques also apply to loop distribution, which is shown to be essentially equivalent to loop fusion. Our approach is general enough to support other fusion heuristics. Preliminary experimental results validate our approach for improving performance by exploiting data locality and increasing the granularity of parallelism.

Optimizing for Parallelism and Data Locality

by Ken Kennedy, Kathryn S. M Kinley - In Proceedings of the 1992 ACM International Conference on Supercomputing , 1992
"... Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on shared-memory mu ..."
Abstract - Cited by 92 (13 self) - Add to MetaCart
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on shared-memory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results. 1 Introduction Transformations to exploit parallelism and to improve data locality are two of the most valuable compiler techniques in use today. Independently, each of these optimizations has been shown to result in dramatic improvements. This paper seeks to combine t...

Cache Interference Phenomena

by O. Temam, C. Fricker, W. Jalby - In Proceedings of the Sigmetrics Conference on Measurement and Modeling of Computer Systems , 1994
"... The impact of cache interferences on program performance (particularly numerical codes, which heavily use the memory hierarchy) remains unknown. The general knowledge is that cache interferences are highly irregular, in terms of occurrence and intensity. In this paper, the different types of cache i ..."
Abstract - Cited by 78 (5 self) - Add to MetaCart
The impact of cache interferences on program performance (particularly numerical codes, which heavily use the memory hierarchy) remains unknown. The general knowledge is that cache interferences are highly irregular, in terms of occurrence and intensity. In this paper, the different types of cache interferences that can occur in numerical loop nests are identified. An analytical method is developed for detecting the occurrence of interferences and, more important, for computing the number of cache misses due to interferences. Simulations and experiments on real machines show that the model is generally accurate and that most interference phenomena are captured. Experiments also show that cache interferences can be intense and frequent. Certain parameters such as array base addresses or dimensions can have a strong impact on the occurrence of interferences. Modifying these parameters only can induce global execution time variations of 30% and more. Applications of these modeling techniq...

Reducing Conflicts In Direct-Mapped Caches With A Temporality-Based Design

by Jude Rivers, Edward S. Davidson - Proceedings of the 1996 ICPP , 1996
"... Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In ..."
Abstract - Cited by 68 (9 self) - Add to MetaCart
Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In this paper, we present a simple but efficient novel hardware design called the Non-Temporal Streaming (NTS) Cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer. Every cache block loaded into the main cache is monitored for temporal behavior by a hardware detection unit. Cache blocks identified as nontemporal are allocated to the buffer on subsequent requests. Our simulations show that the NTS Cache not only provides a performance improvement over the conventional directmapped cache, but can also save on-chip area. For some numerical programs like FFTPDE, APPSP and APPBT from the NAS benchmark suite, an integral NTS Cache of size 9KB (i.e., 8KB...

Fusion of Loops for Parallelism and Locality

by Naraig Manjikian, Tarek S. Abdelrahman - IEEE Transactions on Parallel and Distributed Systems , 1995
"... Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. ..."
Abstract - Cited by 50 (3 self) - Add to MetaCart
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. We present new, systematic techniques which: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) allow parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We evaluate our techniques on a 56-processor KSR2 multiprocessor, and show improvements of up to 20% for representative loop nest sequences. The results also indicate a performance tradeoff as more processors are used, suggesting careful evaluation of the profitability of fusion. 1 Introduction The performance of data-parallel applications on cachecoherent shared-memory multiprocessors is significantly affected by data locality and by the cost ...

Estimating Cache Misses and Locality Using Stack Distances

by Calin Cascaval, David A. Padua , 2003
"... Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data de ..."
Abstract - Cited by 35 (0 self) - Add to MetaCart
Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.

Optimization within a Unified Transformation Framework

by Wayne Anthony Kelly , 1996
"... ..."
Abstract - Cited by 29 (0 self) - Add to MetaCart
Abstract not found

Improving the Performance of DSM Systems via Compiler Involvement

by Ravi Mirchandaney, Seema Hiranandani, Ravi Mirch, Aney Seema Hiran, Ajay Sethi, Ajay Sethi - In Proceedings of Supercomputing '94 , 1994
"... Distributed shared memory (DSM) systems provide an illusion of shared memory on distributed memory systems such as workstation networks and some parallel computers such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements to hardware, software, or a combination thereof. ..."
Abstract - Cited by 26 (1 self) - Add to MetaCart
Distributed shared memory (DSM) systems provide an illusion of shared memory on distributed memory systems such as workstation networks and some parallel computers such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements to hardware, software, or a combination thereof. On these systems, users can write programs using a shared memory style of programming instead of message passing which is tedious and error prone. Our experience with one such system, TreadMarks, has shown that a large class of applications do not perform well on these systems. TreadMarks is a software distributed shared memory system designed by Rice University researchers to run on networks of workstations and massively parallel computers. Due to the distributed nature of the memory system, shared memory synchronization primitives such as locks and barriers often cause significant amounts of communication. We have provided a set of powerful primitives that will alleviate the problems with...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University