Results 1  10
of
13
Register allocation for programs in ssaform
 In Compiler Construction 2006, volume 3923 of LNCS
, 2006
"... In this technical report, we present an architecture for register allocation on the SSAform. We show, how the properties of SSAform programs and their interference graphs can be exploited to develop new methods for spilling, coloring and coalescing. We present heuristic and optimal solution method ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
(Show Context)
In this technical report, we present an architecture for register allocation on the SSAform. We show, how the properties of SSAform programs and their interference graphs can be exploited to develop new methods for spilling, coloring and coalescing. We present heuristic and optimal solution methods for these three subtasks. 1
An experimental comparison of cacheoblivious and cacheconscious programs
 In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. 93–104
, 2007
"... Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems o ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems of smaller size, and when the working set of a subproblem fits in some level of the memory hierarchy, the computations in that subproblem can be executed without suffering capacity misses at that level. In this way, divideandconquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cacheoblivious programs perform compared to carefully tuned cacheconscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cacheoblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cacheoblivious algorithms more competitive.
Efficient Utilization of SIMD Extensions
 IEEE PROCEEDINGS SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND PLATFORM ADAPTATION
, 2003
"... This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLA ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that nearoptimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine’s general purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straightline code vectorization for numerical kernels, and (iii) compiler backends for straightline code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for twoway and 3.3 for fourway vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of handtuned vendor libraries.
Register spilling and liverange splitting for SSAform programs
 IN: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPILER CONSTRUCTION. LECTURE NOTES IN COMPUTER SCIENCE
, 2009
"... Register allocation decides which parts of a variable’s live range are held in registers and which in memory. The compiler inserts spill code to move the values of variables between registers and memory. Since fetching data from memory is much slower than reading directly from a register, careful s ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Register allocation decides which parts of a variable’s live range are held in registers and which in memory. The compiler inserts spill code to move the values of variables between registers and memory. Since fetching data from memory is much slower than reading directly from a register, careful spill code insertion is critical for the performance of the compiled program. In this paper, we present a spilling algorithm for programs in SSA form. Our algorithm generalizes the wellknown furthestfirst algorithm, which is known to work well on straightline code, to controlflow graphs. We evaluate our technique by counting the executed spilling instructions in the CINT2000 benchmark on an x86 machine. The number of executed load (store) instructions was reduced by 54.5 % (61.5%) compared to a stateoftheart linear scan allocator and reduced by 58.2 % (41.9%) compared to a standard graphcoloring allocator. The runtime of our algorithm is competitive with standard linearscan allocators.
Maxcoloring and online coloring with bandwidths on interval graphs
, 2008
"... Given a graph G = (V, E) and positive integral vertex weights w: V → N, the maxcoloring problem seeks to find a proper vertex coloring of G whose color classes C1, C2,..., Ck, minimize ∑k i=1 maxv∈C w(v). This problem, restricted to interval graphs, arises whenever there is a need i to design dedic ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Given a graph G = (V, E) and positive integral vertex weights w: V → N, the maxcoloring problem seeks to find a proper vertex coloring of G whose color classes C1, C2,..., Ck, minimize ∑k i=1 maxv∈C w(v). This problem, restricted to interval graphs, arises whenever there is a need i to design dedicated memory managers that provide better performance than the general purpose memory management of the operating system. Though this problem seems similar to the dynamic storage allocation problem, there are fundamental differences. We make a connection between maxcoloring and online graph coloring and use this to devise a simple 2approximation algorithm for maxcoloring on interval graphs. We also show that a simple firstfit strategy, that is a natural choice for this problem, yields an 8approximation algorithm. We show this result by proving that the firstfit algorithm for online coloring an interval graph G uses no more than 8 · χ(G) colors, significantly improving the bound of 26·χ(G) by Kierstead and Qin (Discrete Math., 144, 1995). We also show that the maxcoloring problem is NPhard. The problem of online coloring of intervals with bandwidths is a simultaneous generalization of online interval coloring and online bin packing. The input is a set I of intervals, each interval i ∈ I having an associated bandwidth b(i) ∈ (0, 1]. We seek an online algorithm that produces a coloring of the intervals such that for any color c and any real r, the sum of the bandwidths of intervals containing r and colored c is at most 1. Motivated by resource allocation problems, Adamy and Erlebach (Proceedings of the First International Workshop on Online and Approximation Algorithms, 2003, LNCS 2909, pp 1–12) consider this problem and present an algorithm that uses at most 195 times the number of colors used by an optimal offline algorithm. Using the new analysis of firstfit coloring of interval graphs, we show that the AdamyErlebach algorithm is 35competitive. Finally, we generalize the AdamyErlebach algorithm to a class of algorithms and show that a different instance from this class is 30competitive.
Fft compiler techniques
 In In Compiler Construction: 13th International Conference, CC 2004, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2004
, 2004
"... www.math.tuwien.ac.at/ascot Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code to be ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
www.math.tuwien.ac.at/ascot Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code to be run on processors featuring twoway short vector SIMD extensions like Intel’s SSE 2 on Pentium 4, SSE 3 on Intel Prescott, AMD’s 3DNow! , and IBM’s SIMD operations implemented on the new processors of the BlueGene/L supercomputer. The paper introduces a special compiler backend for Intel P4’s SSE 2 and AMD’s 3DNow! which is able (i) to exploit particular properties of FFT code, (ii) to generate optimized address computation, and (iii) to perform specialized register allocation and instruction scheduling. Experiments show that the automatic SIMD vectorization techniques of this paper enable performance of hand optimized code for key benchmarks. The newly developed methods have been integrated into the codelet generator of Fftw and successfully vectorized complicated code like realtohalfcomplex nonpower of two FFT kernels. The floatingpoint performance of Fftw’s scalar version has been more than doubled, resulting in the fastest FFT implementation to date. 1
Decoupled (SSAbased) Register Allocators: from Theory to Practice, Coping with JustInTime Compilation and Embedded Processors Constraints.
, 2013
"... In compilation, register allocation is the optimization that chooses which variables of the source program, in unlimited number, are mapped to the actual registers, in limited number. Parts of the liveranges of the variables that cannot be mapped to registers are placed in memory. This eviction is ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In compilation, register allocation is the optimization that chooses which variables of the source program, in unlimited number, are mapped to the actual registers, in limited number. Parts of the liveranges of the variables that cannot be mapped to registers are placed in memory. This eviction is called spilling. Until recently, compilers mainly addressed register allocation via graph coloring using an idea developed by Chaitin et al. [33] in 1981. This approach addresses the spilling and the mapping of the variables to registers in one phase. In 2001, Appel and George [3] proposed to split the register allocation in two separate phases. This idea yields better and independent solutions for both problems, but requires a very aggressive form of liverange splitting, split everywhere, which renames all variables between all instructions of the program. However, in 2005, several groups [27, 84, 56, 16] observed that the static single assignment (SSA) form provides sufficient split points to decouple the register allocation as Appel and George suggested, unless register aliasing or precoloring constraints are involved.
Automatically Optimized FFT Codes for the BlueGene/L Supercomputer, submitted to
 6th International Meeting on High Performance Computing for Computational Science (VecPar ’04
"... Abstract. IBM’s upcoming 360 Tflop/s supercomputer BlueGene/L featuring 65,536 processors is supposed to lead the Top 500 list when being installed in 2005. This paper presents one of the first numerical codes actually run on a small prototype of this machine. Formal vectorization techniques, the Vi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. IBM’s upcoming 360 Tflop/s supercomputer BlueGene/L featuring 65,536 processors is supposed to lead the Top 500 list when being installed in 2005. This paper presents one of the first numerical codes actually run on a small prototype of this machine. Formal vectorization techniques, the Vienna MAP vectorizer (both developed for generic short vector SIMD extensions), and the automatic performance tuning approach provided by Spiral are combined to generate automatically optimized FFT codes for the BlueGene/L machine targeting its twoway short vector SIMD “double ” floatingpoint unit. The resulting FFT codes are 40 % faster than the best scalar Spiral generated code and 5 times faster than the mixedradix FFT implementation provided by the Gnu scientific library GSL. 1
An Experimental Comparison of
"... Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems of ..."
Abstract
 Add to MetaCart
(Show Context)
Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems of smaller size, and when the working set of a subproblem fits in some level of the memory hierarchy, the computations in that subproblem can be executed without suffering capacity misses at that level. In this way, divideandconquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cacheoblivious programs perform compared to carefully tuned cacheconscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cacheoblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cacheoblivious algorithms more competitive with cacheconscious algorithms. 1.
Decoupled (SSAbased) Register Allocators: from Theory to Practice, Coping with JustInTime Compilation and Embedded Processors Constraints.
"... In compilation, register allocation is the optimization that chooses which variables of the source program, in unlimited number, are mapped to the actual registers, in limited number. Parts of the liveranges of the variables that cannot be mapped to registers are placed in memory. This eviction i ..."
Abstract
 Add to MetaCart
(Show Context)
In compilation, register allocation is the optimization that chooses which variables of the source program, in unlimited number, are mapped to the actual registers, in limited number. Parts of the liveranges of the variables that cannot be mapped to registers are placed in memory. This eviction is called spilling. Until recently, compilers mainly addressed register allocation via graph coloring using an idea developed by Chaitin et al. [33] in 1981. This approach addresses the spilling and the mapping of the variables to registers in one phase. In 2001, Appel and George [3] proposed to split the register allocation in two separate phases. This idea yields better and independent solutions for both problems, but requires a very aggressive form of liverange splitting, split everywhere, which renames all variables between all instructions of the program. However, in 2005, several groups [27, 84, 56, 16] observed that the static single assignment (SSA) form provides sufficient split points to decouple the register allocation as Appel and George suggested, unless register aliasing or precoloring