Results 1 - 10
of
23
Estimating Cache Misses and Locality Using Stack Distances
, 2003
"... Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data de ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.
Efficient and precise array access analysis
- ACM Trans. Program. Lang. Syst
, 2000
"... A number of existing compiler techniques hinge on the analysis of array accesses in a program. The most important task in array access analysis is to collect the information about array accesses of interest and summarize it in some standard form. Traditional forms used in array access analysis are s ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
A number of existing compiler techniques hinge on the analysis of array accesses in a program. The most important task in array access analysis is to collect the information about array accesses of interest and summarize it in some standard form. Traditional forms used in array access analysis are sensitive to the complexity of array subscripts; that is, they are usually quite accurate and efficient for simple array subscripting expressions, but lose accuracy or require potentially expensive algorithms for complex subscripts. Our study has revealed that in many programs, particularly numerical applications, many access patterns are simple in nature even when the subscripting expressions are complex. Based on this analysis, we have developed a new, general array region representational form, called the linear memory access descriptor (LMAD). The key idea of the LMAD is to relate all memory accesses to the linear machine memory rather than to the shape of the logical data structures of a programming language. This form helps us expose the simplicity of the actual patterns of array accesses in memory, which may be hidden by complex array subscript expressions. Our recent experimental studies show that our new representation simplifies array
Principles of Speculative Run-time Parallelization
, 1998
"... . Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
. Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re--executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the ...
HUNTing the overlap
- IN: PACT ’05: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT’05
, 2005
"... Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: performance and programmability. In terms of performance, optimizers need to find a good communication schedule and are sometimes constrained by lack of full application knowledge. In terms of programmability, efficiently managing nonblocking communication can prove cumbersome for complex applications. In this paper we present the design principles of HUNT, a runtime system designed to search and exploit some of the available overlap present at execution time in UPC programs. Using virtual memory support, our runtime implements demand-driven synchronization for data involved in communication operations. It also employs message decomposition and scheduling heuristics to transparently improve the non-blocking behavior of applications. We provide a user level implementation of HUNT on a variety of modern high performance computing systems. Results indicate that our approach is successful in finding some of the overlap available at execution time. While system and application characteristics influence performance, perhaps the determining factor is the time taken by the CPU to execute a signal handler. Demand driven synchronization at execution time eliminates the need for the explicit management of non-blocking communication. Besides increasing programmer productivity, this feature also simplifies compiler analysis for communication optimizations.
An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM
- In Proceedings of the 12th International workshop on Languages and Compilers for Parallel Computing (LCPC'99
, 1999
"... . Nowadays NUMA architectures are widely accepted. For such multiprocessors exploiting data locality is clearly a key issue. In this work, we present a method for automatically selecting the iteration/data distributions for a sequential F77 code, while minimizing the parallel execution overhead (com ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
. Nowadays NUMA architectures are widely accepted. For such multiprocessors exploiting data locality is clearly a key issue. In this work, we present a method for automatically selecting the iteration/data distributions for a sequential F77 code, while minimizing the parallel execution overhead (communications and load unbalance). We formulate an integer programming problem to achieve that minimum parallel overhead. The constraints of the integer programming problem are derived directly from a graph known as the Locality-Communication Graph (LCG), which captures the memory locality, as well as the communication patterns, of a parallel program. The constraints derived from the LCG express the data locality requirements for each array and the affinity relations between different array references. In addition, our approach use the LCG to automatically schedule the communication operations required during the program execution, once the iteration/data distributions have been selected. Th...
Implementation Issues of Loop-level Speculative Run-time Parallelization
- In Proc. of the 8th Int. Conference on Compiler Construction (CC'99
, 1999
"... . Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
. Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re--executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the ...
Access Descriptor based Locality Analysis for Distributed-Shared Memory Multiprocessors
- Proceedings of International Conference on Parallel Processing
, 1999
"... Most of today's multiprocessors have a DistributedShared Memory (DSM) organization, which enables scalability while retaining the convenience of the shared-memory programming paradigm. Data locality is crucial for performance in DSM machines, due to the difference in access times between local and r ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Most of today's multiprocessors have a DistributedShared Memory (DSM) organization, which enables scalability while retaining the convenience of the shared-memory programming paradigm. Data locality is crucial for performance in DSM machines, due to the difference in access times between local and remote memories. In this paper, we present a compile-time representation that captures the memory locality exhibited by a program in the form of a graph known as Locality-Communication Graph (LCG). In the LCG, each node represents a DO loop nest which can have at most one level of parallelism. Not all loops need to be represented within a node and, therefore, the LCG may contain cycles. Our representation works whether the loops represented by the nodes are perfectly nested or not, and the subscript expressions and loop limits can be affine or non-affine expressions of the loop indices. The LCG provides essential information that a parallelizing compiler can use to automatically choose a good iteration/data distribution and to schedule the communication operations required during program execution.
Sparse Constant Propagation via Memory Classification Analysis
, 1999
"... This article presents a novel Sparse Constant Propagation technique which provides a heretofore unknown level of practicality. Unlike other techniques which are based on data flow, it is based on the execution-order summarization sweep employed in Memory Classification Analysis (MCA), a technique ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This article presents a novel Sparse Constant Propagation technique which provides a heretofore unknown level of practicality. Unlike other techniques which are based on data flow, it is based on the execution-order summarization sweep employed in Memory Classification Analysis (MCA), a technique originally developed for array dependence analysis. This methodology achieves a precise description of memory reference activity within a summary representation that grows only linearly with program size. Because of this, the collected sparse constant information need not be artificially limited to satisfy classical data flow lattice requirements, which constrain other algorithms to discard information in the interests of efficient termination. Sparse Constant Propagation is not only more effective within the MCA framework, but it in fact generalizes the framework. Original MCA provids the means to break only simple induction and reduction types of flow-dependences. The integrated fra...
A Comparative Analysis of Dependence Testing Mechanisms
- in Proceedings of LCPC2000, 13th International Workshop on Languages and Compiler for Parallel Computing
, 2000
"... The internal mechanism used for a dependence test constrains its accuracy and determines its speed. The internal mechanism used for our Access Region Test (ART) is fundamentally different from that used in any other dependence test, and therefore its constraints and characteristics are likewise di ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The internal mechanism used for a dependence test constrains its accuracy and determines its speed. The internal mechanism used for our Access Region Test (ART) is fundamentally different from that used in any other dependence test, and therefore its constraints and characteristics are likewise different. In this paper, we briefly describe our novel descriptor for representing memory accesses, the Access Region Descriptor (ARD) and the ART. We then describe the ARD intersection algorithm in some detail. Finally, we compare and contrast the mechanisms of the ARD intersection algorithm with the internal mechanisms for a number of popular prior dependence tests.
Parallelization of Benchmarks for Scalable Shared-Memory Multiprocessors
- In IEEE PACT
, 1998
"... This work identifies practical compiling techniques for scalable shared memory machines. For this, we have focused on experimental studies using a real machine and representative codes. In the experiments, we transformed conventional codes to shared memory codes using several existing techniques and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This work identifies practical compiling techniques for scalable shared memory machines. For this, we have focused on experimental studies using a real machine and representative codes. In the experiments, we transformed conventional codes to shared memory codes using several existing techniques and ran the output on the target machine to evaluate those techniques and to identify where improvement is needed. Based on the analysis of the results, we developed a few new techniques, and experimented again on the target machine to measure the effectiveness of each one. The results reported in this paper were quite positive, lending useful information to future research on compiler optimizations for existing SSM machines.

