Results 1 - 10
of
19
Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee architecture
, 2006
"... ... In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability wh ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
... In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
, 2002
"... In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword reg ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses.
Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements
- IEEE Transactions on Computers
, 2003
"... Multimedia SIMD extensions such as MMX and AltiVec speedup media processing, however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Multimedia SIMD extensions such as MMX and AltiVec speedup media processing, however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75-85% of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1-12% of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine- and coarsegrained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping and data reorganization (permute, packing/unpacking, transpose, etc). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10% increase in area required by MMX and SSE extensions (0.3% increase in overall chip area) and 1% of total processor power consumption.
A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions
- In 16th International Workshop of Languages and Compilers for Parallel Computing
, 2003
"... In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming and the use of system libraries and intrin ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming and the use of system libraries and intrinsic functions. Because of the similarity between multimedia extensions and vector processors, it is believed that traditional vectorization can be used to compile multimedia extensions. Can traditional vectorization effectively vectorize for multimedia extensions? If not, what additional techniques are needed? This paper tries to answer these two questions. Based on a code study of the Berkeley Multimedia Workload, we identify several new challenges arise in vectorizing for multimedia extensions, and provide some solutions to these challenges.
Efficient Utilization of SIMD Extensions
- IEEE PROCEEDINGS SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND PLATFORM ADAPTATION
, 2003
"... This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLA ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine’s general purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
Energy aware Compilation for DSPs with SIMD instructions
, 2002
"... The growing use of digital signal processors (DSPs) in embedded systems makes the use of optimizing compilers supporting special hardware features necessary. In this paper we present compiler optimizations with the aim of minimizing energy consumption of embedded applications: This comprises loop ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The growing use of digital signal processors (DSPs) in embedded systems makes the use of optimizing compilers supporting special hardware features necessary. In this paper we present compiler optimizations with the aim of minimizing energy consumption of embedded applications: This comprises loop optimizations for exploitation of SIMD instructions and zero overhead hardware loops in order to increase performance and in this way to decrease the energy consumption. In addition, we use a phase coupled code generator (GCG) based on a genetic algorithm which is capable of performing an energy aware instruction selection and scheduling. Energy aware compilation is done with respect to an instruction level energy cost model which is integrated into our code generator and simulator. Experimental results for several benchmarks show the effectiveness of our approach.
A Retargetable Preprocessor for Multimedia Instructions
- PROC. WORKSHOP ON COMPILERS FOR PARALLEL COMPUTERS
, 2001
"... Request for more computation power in the media computing domain has led to multimedia extensions in the instruction sets of modern processors. This has resulted in the implementation of new instructions which handle short data types and exploit subword parallelism for improving the performance. C ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Request for more computation power in the media computing domain has led to multimedia extensions in the instruction sets of modern processors. This has resulted in the implementation of new instructions which handle short data types and exploit subword parallelism for improving the performance. Compiler developments which aim at providing some degree of support to these instructions now emerge and usually fall into three categories: vectorization, idiom recognition and code generation. Unfortunately there is no consensus among constructors on the definition of multimedia instructions. Therefore, very few works have addressed the general issue of exploiting multimedia instructions from high level languages. In this paper
SIMD vectorization of straight line FFT code
- Proceedings of the Euro-Par ’03 Conference on Parallel and Distributed Computing LNCS 2790
, 2003
"... Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. FFT kernels are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vecto ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. FFT kernels are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vector SIMD extensions like AMD’s 3DNow! and Intel’s SSE 2. Additionally, a special compiler backend is introduced which is able to (i) utilize particular code properties, (ii) generate optimized address computation, and (iii) apply specialized register allocation and instruction scheduling. Experiments show that automatic SIMD vectorization can achieve performance that is comparable to the optimal hand-generated code for FFT kernels. The newly developed methods have been integrated into the codelet generator of Fftw and successfully vectorized complicated code like real-to-halfcomplex non-power-of-two FFT kernels. The floatingpoint performance of Fftw’s scalar version has been more than doubled, resulting in the fastest FFT implementation to date. 1
Exploiting Superword-Level Locality in Multimedia Extension Architectures
- Journal of Instruction Level Parallelism (JILP
, 2003
"... In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 3.1X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses.
Exploiting vector parallelism in software pipelined loops
- In Proc. of the 38th Annual International Symposium on Microarchitecture
, 2005
"... An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data paralle ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
An emerging trend in processor design is the incorporation of short vector instructions into the ISA. In fact, vector extensions have appeared in most general-purpose microprocessors. To utilize these instructions, traditional vectorization technology can be used to identify and exploit data parallelism. In contrast, efficient use of a processor’s scalar resources is typically achieved through ILP techniques such as software pipelining. In order to attain the best performance, it is necessary to utilize both sets of resources. This paper presents a novel approach for exploiting vector parallelism in a software pipelined loop. At its core is a method for judiciously partitioning operations between vector and scalar resources. The proposed algorithm (i) lowers the burden on the scalar resources by offloading computation to the vector functional units, and (ii) partially (or fully) inhibits the optimizations when full vectorization will decrease performance. This results in better resource usage and allows for software pipelining with shorter initiation intervals. Although our techniques complement statically scheduled machines most naturally, we believe they are applicable to any architecture that tightly integrates support for ILP and data parallelism. An important aspect of the proposed methodology is its ability to manage explicit communication of operands between vector and scalar instructions. Our methodology also allows for a natural handling of misaligned vector memory operations. For architectures that provide hardware support for misaligned references, software pipelining effectively hides the latency of these potentially expensive instructions. When explicit alignment is required in software, our algorithm accounts for these extra costs and vectorizes only when it is profitable. Finally, our heuristic can take advantage of alignment information where it is available. We evaluate our methodology using several DSP and SPEC FP benchmarks. Compared to software pipelining, our approach is able to achieve an average speedup of 1.30 × and 1.18 × for the two benchmark sets, respectively.

