Results 1 - 10
of
10
Optimizing compiler for the cell processor
- In PACT
, 2005
"... Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured progra ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler. 1.
Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee architecture
, 2006
"... ... In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability wh ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
... In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.
Polyhedral-model guided loop-nest auto-vectorization
- in PACT ’09: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
"... Abstract—Optimizing compilers apply numerous interdependent optimizations, leading to the notoriously difficult phase-ordering problem — that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Abstract—Optimizing compilers apply numerous interdependent optimizations, leading to the notoriously difficult phase-ordering problem — that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of transformations, facilitating the efficient exploration and configuration of multiple transformation sequences. Many powerful optimizations, however, remain external to the polyhedral framework, including vectorization. The low-level, target-specific aspects of vectorization for fine-grain SIMD has so far excluded it from being part of the polyhedral framework. In this paper we examine the interactions between loop transformations of the polyhedral framework and subsequent vectorization. We model the performance impact of the different loop transformations and vectorization strategies, and then show how this cost model can be integrated seamlessly into the polyhedral representation. This predictive modelling facilitates efficient exploration and educated decision making to best apply various polyhedral loop transformations while considering the subsequent effects of different vectorization schemes. Our work demonstrates the feasibility and benefit of tuning the polyhedral model in the context of vectorization. Experimental results confirm that our model has accurate predictions, providing speedups of over 2.0x on average over traditional innermost-loop vectorization on PowerPC970 and Cell-SPU SIMD platforms. I.
ABSTRACT Compiling for Vector-Thread Architectures
"... Vector-thread (VT) architectures exploit multiple forms of parallelism simultaneously. This paper describes a compiler for the Scale VT architecture, which takes advantage of the VT features. We focus on compiling loops, and show how the compiler can transform code that poses difficulties for tradit ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Vector-thread (VT) architectures exploit multiple forms of parallelism simultaneously. This paper describes a compiler for the Scale VT architecture, which takes advantage of the VT features. We focus on compiling loops, and show how the compiler can transform code that poses difficulties for traditional vector or VLIW processors, such as loops with internal control flow or cross-iteration dependences, while still taking advantage of features not supported by multithreaded designs, such as vector memory instructions. We evaluate the compiler using several embedded benchmarks and show that we can obtain substantial speedups over a single-issue, in-order scalar machine. Categories and Subject Descriptors
Exploiting vector parallelism in software pipelined loops
- In Proc. of the 38th Annual International Symposium on Microarchitecture
, 2005
"... An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data paralle ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data parallelism. In contrast, efficient use of a processor’s scalar resources is typically achieved through ILP techniques such as software pipelining. In order to attain the best performance, it is necessary to utilize both sets of resources. This paper presents a novel approach for exploiting vector parallelism in a software pipelined loop. At its core is a method for judiciously partitioning operations between vector and scalar resources. The proposed algorithm (i) lowers the burden on the scalar resources by offloading computation to the vector functional units, and (ii) partially (or fully) inhibits the optimizations when full vectorization will decrease performance. This results in better resource usage and allows for software pipelining with shorter initiation intervals. Although our techniques complement statically scheduled machines most naturally, we believe they are applicable to any architecture that tightly integrates support for ILP and data parallelism. An important aspect of the proposed methodology is its ability to manage explicit communication of operands between vector and scalar instructions. Our methodology also allows for a natural handling of misaligned vector memory operations. For architectures that provide hardware support for misaligned references, software pipelining effectively hides the latency of these potentially expensive instructions. When explicit alignment is required in software, our algorithm accounts for these extra costs and vectorizes only when it is profitable. Finally, our heuristic can take advantage of alignment information where it is available. We evaluate our methodology using several DSP and SPEC FP benchmarks. Compared to software pipelining, our approach is able to achieve an average speedup of 1.30 × and 1.18 × for the two benchmark sets, respectively.
Evaluating Compiler Technology for Control-Flow Optimizations for Multimedia Extension Architectures
"... This paper addresses how to automatically generate code for multimedia extension architectures in the presence of conditionals. We evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword (an aggregate object larger than a mach ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper addresses how to automatically generate code for multimedia extension architectures in the presence of conditionals. We evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword (an aggregate object larger than a machine word) such as the branch-on-any instruction of the AltiVec. Branch-on-superword-condition-codes (BOSCC) instructions allow fast detection of aggregate conditions, an optimization opportunity often found in multimedia applications. This paper presents compiler analyses and techniques for generating efficient parallel code using BOSCC instructions. We evaluate our approach, which has been implemented in the SUIF compiler, through a set of experiments with multimedia benchmarks, and compare it with the default approach previously implemented in our compiler. Our experimental results show that using BOSCC instructions can result in better performance for applications where the aggregate condition codes of a superword often evaluate to the same value.
SIMD Programming by Expansion by
"... AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up Nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform p0ublicly and display publicly, by or on behalf of the G ..."
Abstract
- Add to MetaCart
AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up Nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform p0ublicly and display publicly, by or on behalf of the Government. SIMD Programming by Expansion Since its advent 30 years ago, single-instruction multiple-data (SIMD) functional units continue to provide an opportunity for high performance at a low hardware cost. However, a general consensus is that only a class of well-formed computations is suitable for SIMD execution. We believe that the boundary of the class should be pushed so that more applications can get the benefit of SIMD parallelism. Our goal is to provide programmers tools that will allow easier access to SIMD functional units. In this paper, we describe a new method to generate SIMD instructions automatically. Unlike the current approaches that target either loops or basic blocks, our approach targets a whole function. Instead of trying to keep the sequential execution semantics, we semantically transform the given input function by replacing the operators and operands with their SIMD counterparts. The output functions generated this way take vector arguments and return a vector value. We have implemented the new method in a compiler, called EXPAND, and show how to use it for user applications. To demonstrate the effectiveness of the new method, we apply the EXPAND compiler to 12 GNU math library intrinsic functions. When measured on a PowerPC G5, the transformed output codes achieve speedups ranging from 2.05 to 11.37 over the scalar baseline. 1
Author manuscript, published in "The 18th International Conference on Parallel Architectures and Compilation Techniques (2009)" Polyhedral-Model Guided Loop-Nest Auto-Vectorization
, 2011
"... Abstract—Optimizing compilers apply numerous interdependent optimizations, leading to the notoriously difficult phase-ordering problem — that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of ..."
Abstract
- Add to MetaCart
Abstract—Optimizing compilers apply numerous interdependent optimizations, leading to the notoriously difficult phase-ordering problem — that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of transformations, facilitating the efficient exploration and configuration of multiple transformation sequences. Many powerful optimizations, however, remain external to the polyhedral framework, including vectorization. The low-level, target-specific aspects of vectorization for fine-grain SIMD has so far excluded it from being part of the polyhedral framework. In this paper we examine the interactions between loop transformations of the polyhedral framework and subsequent vectorization. We model the performance impact of the different loop transformations and vectorization strategies, and then show how this cost model can be integrated seamlessly into the polyhedral representation. This predictive modelling facilitates efficient exploration and educated decision making to best apply various polyhedral loop transformations while considering the subsequent effects of different vectorization schemes. Our work demonstrates the feasibility and benefit of tuning the polyhedral model in the context of vectorization. Experimental results confirm that our model has accurate predictions, providing speedups of over 2.0x on average over traditional innermost-loop vectorization on PowerPC970 and Cell-SPU SIMD platforms. I.

