Results 1 - 10
of
42
Efficient Context-Sensitive Pointer Analysis for C Programs
, 1995
"... This paper proposes an efficient technique for contextsensitive pointer analysis that is applicable to real C programs. For efficiency, we summarize the effects of procedures using partial transfer functions. A partial transfer function (PTF) describes the behavior of a procedure assuming that certa ..."
Abstract
-
Cited by 375 (9 self)
- Add to MetaCart
This paper proposes an efficient technique for contextsensitive pointer analysis that is applicable to real C programs. For efficiency, we summarize the effects of procedures using partial transfer functions. A partial transfer function (PTF) describes the behavior of a procedure assuming that certain alias relationships hold when it is called. We can reuse a PTF in many calling contexts as long as the aliases among the inputs to the procedure are the same. Our empirical results demonstrate that this technique is successful---a single PTF per procedure is usually sufficient to obtain completely context-sensitive results. Because many C programs use features such as type casts and pointer arithmetic to circumvent the high-level type system, our algorithm is based on a low-level representation of memory locations that safely handles all the features of C. We have implemented our algorithm in the SUIF compiler system and we show that it runs efficiently for a set of C benchmarks. 1 Introd...
Data Transformations for Eliminating Conflict Misses
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses ..."
Abstract
-
Cited by 118 (12 self)
- Add to MetaCart
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Cache Miss Equations: An Analytical Representation of Cache Misses
- In Proceedings of the 1997 ACM International Conference on Supercomputing
, 1997
"... With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve memory performance. Effective transformations requir ..."
Abstract
-
Cited by 99 (4 self)
- Add to MetaCart
With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve memory performance. Effective transformations require detailed knowledge about the frequency and causes of cache misses in the code.
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 75 (13 self)
- Add to MetaCart
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity
- In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and ot ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and other program transformations. Compiler optimization can be effective, but the lack of precise analysis and optimization frameworks makes it impossible to confidently make optimal, rather than heuristic-based, program transformations. Imprecision is most problematic in situations where hard-to-predict cache conflicts foil heuristic approaches. Furthermore, the lack of a general framework for compiler memory performance analysis makes it impossible to understand the combined effects of several program transformations. The Cache Miss Equation (CME) framework discussed in this paper addresses these issues. We express memory reference and cache conflict behavior in terms of sets of equations. The ...
Instruction Generation for Hybrid Reconfigurable Systems
- ACM Transactions on Design Automation of Electronic Systems
, 2001
"... Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via manual clustering of necessary operations into more compact operations, essentially a form of template construction. Whereas our template generation and matching algorithms are automated, the definition of clusters in Cathedral was a manual operation, mainly clustering loop and function bodies. Their results demonstrated an expected reduction of critical path length as well as interconnect as a result of clustering.
Quality and Speed in Linear-Scan Register Allocation
- In SIGPLAN Conference on Programming Language Design and Implementation
, 1998
"... ing control flow as a linear ordering of the basic blocks makes linear-scan allocators run efficiently. If the allocation decisions in each basic block were independent from the decisions in other blocks, then the order in which we processed the blocks would be immaterial. But in fact some informati ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
ing control flow as a linear ordering of the basic blocks makes linear-scan allocators run efficiently. If the allocation decisions in each basic block were independent from the decisions in other blocks, then the order in which we processed the blocks would be immaterial. But in fact some information about the register state and consistency is carried beyond basic block boundaries. This section enumerates the possible edges and transitions in the linear ordering and their effect on this information. The simplest edge (u,v) followed in the linear ordering occurs when u has no other successors and v has no other predecessors. The edge (2,3) in Figure 5 is an example. 21 Since this edge is the only possible transition out of u and into v, any state existing at the bottom of u must also hold at the top of v. This kind of edge is relatively rare since the compiler usually collapses the two blocks into a single one. The next kind of edge occurs when u has multiple successors, but v has on...
The Zephyr abstract syntax description language
- In Proceedings of the Conference on Domain-Specific Languages
, 1997
"... The following paper was originally published in the ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
The following paper was originally published in the
Eliminating Conflict Misses for High Performance Architectures
- In Proceedings of the 1998 ACM International Conference on Supercomputing
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. Two data-layout transformations, inter- and intra-variable padding, can eliminate many conflict misses at compile time. We present GroupPad, an inter-variable padding heuristic to preserve group reuse ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. Two data-layout transformations, inter- and intra-variable padding, can eliminate many conflict misses at compile time. We present GroupPad, an inter-variable padding heuristic to preserve group reuse in stencil computations frequently found in scientific computations. We show padding can also improve performance in parallel programs. Our optimizations have been implemented and tested on a collection of kernels and programs for different cache and data sizes. Preliminary results demonstrate GroupPad is able to consistently preserve group reuse among the programs evaluated, though execution time improvements are small for actual problem and cache sizes tested. Padding improves performance of parallel versions of programs approximately the same magnitude as sequential versions of the same program. 1 Introduction Effectively exploiting caches is widely regarded as the key to achieving good...
Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation
- Languages and Compilers for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, Revised Papers, volume 2958 of LNCS
, 2003
"... Cetus is a compiler infrastructure for the source-to-source transformation of programs. We created Cetus out of the need for a compiler research environment that facilitates the development of interprocedural analysis and parallelization techniques for C, C++, and Java programs. We will describe ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Cetus is a compiler infrastructure for the source-to-source transformation of programs. We created Cetus out of the need for a compiler research environment that facilitates the development of interprocedural analysis and parallelization techniques for C, C++, and Java programs. We will describe our rationale for creating a new compiler infrastructure and give an overview of the Cetus architecture. The design is intended to be extensible for multiple languages and will become more flexible as we incorporate feedback from any di#culties we encounter introducing other languages. We will characterize Cetus' runtime behavior of parsing and IR generation in terms of execution time, memory usage, and parallel speedup of parsing, as well as motivate its usefulness through examples of projects that use Cetus. We will then compare these results with those of the Polaris Fortran translator.

