Results 1 - 10
of
14
Enhanced Code Compression for Embedded RISC Processors
, 1999
"... This paper explores compiler techniques for reducing the memory needed to load and run program executables. In embedded systems, where economic incentives to reduce both ram and rom are strong, the size of compiled code is increasingly important. Similarly, in mobile and network computing, the need ..."
Abstract
-
Cited by 89 (2 self)
- Add to MetaCart
This paper explores compiler techniques for reducing the memory needed to load and run program executables. In embedded systems, where economic incentives to reduce both ram and rom are strong, the size of compiled code is increasingly important. Similarly, in mobile and network computing, the need to transmit an executable before running it places a premium on code size. Our work focuses on reducing the size of a program's code segment, using pattern-matching techniques to identify and coalesce together repeated instruction sequences. In contrast to other methods, our framework preserves the ability to run program executables directly, without an intervening decompression stage. Our compression framework is integrated into an industrial-strength optimizing compiler, which allows us to explore the interaction between code compression and classical code optimization techniques, and requires that we contend with the difficulties of compressing previously optimized code. The specific contributions in this paper include a comprehensive experimental evaluation of code compression for a Risc-like architecture, a more powerful pattern-matching scheme for improved identification of repeated code fragments, and a new form of profile-driven code compression that reduces the speed penalty arising from compression.
A Scheduler-Sensitive Global Register Allocator
- IN SUPERCOMPUTING '93 PROCEEDINGS
, 1993
"... Compile-time reordering of machine-level instructions has been very successful at achieving large increases in performance of programs on machines offering fine-grained parallelism. However, because of the interdependences between instruction scheduling and register allocation, it is not clear which ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Compile-time reordering of machine-level instructions has been very successful at achieving large increases in performance of programs on machines offering fine-grained parallelism. However, because of the interdependences between instruction scheduling and register allocation, it is not clear which of these two phases of the compiler should run first to generate the most efficient final code. In this paper, we describe our investigation into slight modifications to key phases of a successful global register allocator to create a scheduler-sensitive register allocator, which is then followed by an "off-the-shelf" instruction scheduler. Our experimental studies reveal that this approach achieves speedups comparable and increasingly better than previous cooperative approaches with an increasing number of available registers without the complexities of the previous approaches.
REGION-BASED COMPILATION
, 1996
"... The increasing amount of instruction-level parallelism (ILP) required to fully utilize high issue-rate processors has forced the compiler to perform more aggressive analysis, optimization, parallelization and scheduling on the input programs. Yet, the compiler designer must scale back the use of agg ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
The increasing amount of instruction-level parallelism (ILP) required to fully utilize high issue-rate processors has forced the compiler to perform more aggressive analysis, optimization, parallelization and scheduling on the input programs. Yet, the compiler designer must scale back the use of aggressive transformations in order to contain compile time and memory usage. The root of the problem lies in the function-oriented framework assumed in conventional compilers. Traditionally the compilation process has been built using the function as a compilation unit, because the function provides a convenient partition of the program. However, the size and contents of a function may not provide the best environment for aggressive analysis and optimization. This dissertation presents a technique in which the compiler is allowed to repartition the program into more desirable compilation units, called regions. Placing the compiler in control of the size and contents of the compilation unit reduces the importance of the algorithmic complexity of the applied transformations, allowing more aggressive transformations to be applied while reducing compilation time. The region concept has been traditionally applied within an ILP compiler only in the context of code scheduling. This dissertation proposes extending the concept of region partitioning to
Load/Store Range Analysis for Global Register Allocation
- Proc. of the SIGPLAN Conference on Programming Language Design and Implementation
, 1994
"... Live range splitting techniques improve global register allocation by splitting the live ranges of variables into segments that are individually allocated registers. Load/store range analysis is a new technique for live range splitting that is based on reaching definition and live variable analyses. ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Live range splitting techniques improve global register allocation by splitting the live ranges of variables into segments that are individually allocated registers. Load/store range analysis is a new technique for live range splitting that is based on reaching definition and live variable analyses. Our analysis localizes the profits and the register requirements of every access to every variable to provide a fine granularity of candidates for register allocation. Experiments on a suite of C and FORTRAN benchmark programs show that a graph coloring register allocator operating on load/store ranges often provides better allocations than the same allocator operating on live ranges. Experimental results also show that the computational cost of using load/store ranges for register allocation is moderately more than the cost of using live ranges. 1 Introduction Register allocation maps variables in an intermediate language program to either registers or memory locations in order to minimiz...
Register Pressure Sensitive Redundancy Elimination
- 8th Int’l. Conf. on Compiler Construction
, 1999
"... . Redundancy elimination optimizations avoid repeated computation of the same value by computing the value once, saving it in a temporary, and reusing the value from the temporary when it is needed again. Examples of redundancy elimination optimizations include common subexpression elimination, loop ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
. Redundancy elimination optimizations avoid repeated computation of the same value by computing the value once, saving it in a temporary, and reusing the value from the temporary when it is needed again. Examples of redundancy elimination optimizations include common subexpression elimination, loop invariant code motion and partial redundancy elimination. We demonstrate that the introduction of temporaries to save computed values can result in a significant increase in register pressure. An increase in register pressure may in turn trigger generation of spill code which can more than offset the gains derived from redundancy elimination. While current techniques minimize increases in register pressure, to avoid spill code generation it is instead necessary to ensure that register pressure does not exceed the number of available registers. In this paper we develop a redundancy elimination algorithm that is sensitive to register pressure: our novel technique first sets upper limits on al...
Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures
- IEEE Transactions on Computers
, 2003
"... Abstract — In this paper we address the problem of generating an optimal ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract — In this paper we address the problem of generating an optimal
A Systematic Approach to Delivering INSTRUCTION-LEVEL PARALLELISM IN EPIC SYSTEMS
, 2005
"... Computer systems designed under the explicitly parallel instruction computing (EPIC) paradigm rely extensively on compiler technology to deliver the instruction-level parallelism (ILP) required for them to achieve high levels of performance. While manifold techniques have been proposed in the litera ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Computer systems designed under the explicitly parallel instruction computing (EPIC) paradigm rely extensively on compiler technology to deliver the instruction-level parallelism (ILP) required for them to achieve high levels of performance. While manifold techniques have been proposed in the literature for delivering such parallelism, this dissertation is unique in integrating and applying a comprehensive suite of techniques, embodied in the IMPACT Research Compiler, to a concrete system, comprised of the SPEC CINT2000 benchmarks and the Intel Itanium 2 platform. These techniques include advanced pointer analysis, aggressive cross-file procedure inlining, targeted region formation, profile-guided optimizations, and, most importantly, aggressive and pervasive use of predication and control speculation. The collective effect of these techniques is evaluated with real-system measurements, showing them to achieve a 1.20 average (up to 1.59) speedup relative to classically optimized code and a 1.70 average (up to 2.51) speedup relative to code compiled with the Gnu GCC compiler. Achieving these results in the real-machine environment required advances in region formation heuristics, optimization, and speculation methods. Modern
Minimum Register Instruction Scheduling: A New Approach for Dynamic Instruction Issue Processors
- In Proc. of the Twelfth International Workshop on Languages and Compilers for Parallel Computing
, 1999
"... . Modern superscalar architectures with dynamic scheduling and register renaming capabilities have introduced subtle but important changes into the tradeoffs between compile-time register allocation and instruction scheduling. In particular, it is perhaps not wise to increase the degree of parall ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
. Modern superscalar architectures with dynamic scheduling and register renaming capabilities have introduced subtle but important changes into the tradeoffs between compile-time register allocation and instruction scheduling. In particular, it is perhaps not wise to increase the degree of parallelism of the static instruction schedule at the expense of excessive register pressure which may result in additional spill code. To the contrary, it may even be beneficial to reduce the register pressure at the expense of constraining the degree of parallelism of the static instruction schedule. This leads to the following interesting problem: given a data dependence graph (DDG) G, can we derive a schedule S for G that uses the least number of registers ? In this paper, we present a heuristic approach to compute the near-optimal number of registers required for a DDG G (under all possible legal schedules). We propose an extended list-scheduling algorithm which uses the above number...
Removing Communications in Clustered Microarchitectures Through Instruction Replication
"... The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is don ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is done carefully because it might degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-ofthe-art modulo-scheduling algorithm. Though this algorithm has been proved to be very effective at reducing communications, results show that the number of communications can be further decreased by around one-third through replication, which results in a significant speedup. IPC is increased by 25 % on average for a four-cluster microarchitecture and by as much as 70 % for selected programs. We also show that replicating appropriate sets of instructions is more effective than doubling the intercluster connection network bandwidth.

