Results 1 - 10
of
39
Design and Implementation of a Lightweight Dynamic Optimization System
- Journal of Instruction-Level Parallelism
, 2004
"... Many opportunities exist to improve micro-architectural performance due to performance events that are di#cult to optimize at static compile time. Cache misses and branch mis-prediction patterns may vary for di#erent micro-architectures using di#erent inputs. ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
Many opportunities exist to improve micro-architectural performance due to performance events that are di#cult to optimize at static compile time. Cache misses and branch mis-prediction patterns may vary for di#erent micro-architectures using di#erent inputs.
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation
- ISCA'04
, 2004
"... In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-leve ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4 % to 12 % depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5 % to 19%, and up to 49%.
Performance Characterization of a Hardware Mechanism for Dynamic Optimization
- In 34 th International Symposium on Microarchitecture
, 2001
"... We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run c ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run concurrently with program execution. The recovery mechanism enables the optimizer to make speculative optimizations without requiring recovery code.
LLVA: A Low-level Virtual Instruction Set Architecture
- IN MICRO-36
, 2003
"... A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates tran ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates translation and optimization. In fact, there has been little research on specific designs for a virtual ISA for processors. This paper proposes a novel virtual ISA (LLVA) and a translation strategy for implementing it on arbitrary hardware. The instruction set is typed, uses an infinite virtual register set in Static Single Assignment form, and provides explicit control-flow and dataflow information, and yet uses low-level operations closely matched to traditional hardware. It includes novel mechanisms to allow more flexible optimization of native code, including a flexible exception model and minor constraints on self-modifying code. We propose a translation strategy that enables offline translation and transparent offline caching of native code and profile information, while remaining completely OS-independent. It also supports optimizations directly on the representation at install-time, runtime, and offline between executions. We show experimentally that the virtual ISA is compact, it is closely matched to ordinary hardware instruction sets, and permits very fast code generation, yet has enough high-level information to permit sophisticated program analyses and optimizations.
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
, 2004
"... In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynami ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynamic detection and speculative execution of instruction strands--linear chains of dependent instructions without intermediate fan-out. The hardware required for detecting these chains is simple and resides off the critical path of the pipeline, and the execution targets are the normal ALUs with a self-bypass mode. By collapsing these strings of dependencies into atomic macro-instructions, the efficiency of the issue queue and reorder buffer can be increased. Our results show that over 25% of all dynamic ALU instructions can be grouped, decreasing both the average reorder buffer occupancy and issue queue occupancy by over a third. Additionally, these strands have several properties which make them amenable to simple performance optimizations. Our experiments show average IPC increases of 17% on a four-wide machine and 20% on an eight-wide machine in Spec2000int and Mediabench applications. Finally, strands ease the IPC penalties of multicycle issue and bypass by reducing dependency pressures, providing opportunity for clock frequency gains as well.
Reno: A rename-based instruction optimizer
- In Proc. 32nd International Symposium on Computer Architecture
, 2005
"... The effectiveness of static code optimizations—including static optimizations performed “just-in-time”—is limited by some basic constraints: (i) a limited number of logical registers, (ii) a function- or region-bounded optimization scope, and (iii) the requirement that transformations be valid along ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
The effectiveness of static code optimizations—including static optimizations performed “just-in-time”—is limited by some basic constraints: (i) a limited number of logical registers, (ii) a function- or region-bounded optimization scope, and (iii) the requirement that transformations be valid along all possible paths. RENO is a modified MIPS-R10000 style register renaming mechanism augmented with physical register reference count-ing that uses map-table “short-circuiting ” to implement dynamic versions of several well-known static optimizations: move elimination, common subexpression elimination, register allocation, and constant folding. Because it implements these opti-mizations dynamically, RENO can overcome some of the limitations faced by static compilers and apply optimizations where static compilers cannot. RENO has many more registers at its disposal—the entire physical register file. Its optimizations naturally cross function or any other compilation region boundary. And RENO performs optimizations along the dynamic path without being impacted by other, non-taken paths. If the dynamic path proves incorrect due to mispeculations, RENO optimizations are naturally rolled back along with the code they optimize. RENO unifies several previously proposed optimizations: dynamic move elimination [14] (RENOME), register integra-tion [24] (RENOCSE), and speculative memory bypassing (the dynamic counterpart of register allocation) [14, 21, 22, 24] (RENORA). To this union, we add a new optimization: RENOCF a dynamic version of constant folding. RENOCF extends the
Hardware atomicity for reliable software speculation
- In ISCA ’07: Proceedings of the 34th annual international symposium on Computer architecture
, 2007
"... Speculative compiler optimizations are effective in improving both single-thread performance and reducing power consumption, but their implementation introduces significant complexity, which can limit their adoption, limit their optimization scope, and negatively impact the reliability of the compil ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Speculative compiler optimizations are effective in improving both single-thread performance and reducing power consumption, but their implementation introduces significant complexity, which can limit their adoption, limit their optimization scope, and negatively impact the reliability of the compilers that implement them. To eliminate much of this complexity, as well as increase the effectiveness of these optimizations, we propose that microprocessors provide architecturally-visible hardware primitives for atomic execution. These primitives provide to the compiler the ability to optimize the program’s hot path in isolation, allowing the use of nonspeculative formulations of optimization passes to perform speculative optimizations. Atomic execution guarantees that if a speculation invariant does not hold, the speculative updates are discarded, the register state is restored, and control is transferred to a nonspeculative
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization
- In Proceedings of the International Symposium on Microarchitecture
, 2004
"... Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A configurable array of function units is added to the baseline processor that enables the acceleration of a wide range of dataflow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the e#ectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeo#s between static and dynamic identification of subgraphs for instruction set customization.
Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth
- In Proc. of the 37th Annual International Symposium on Microarchitecture
, 2004
"... A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor’s stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2–12 % over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth. 1.
Dynamic Trace Selection Using Performance Monitoring Hardware Sampling
- in Proceedings of the 1st International Symposium on Code Generation and Optimization
, 2003
"... Optimizing programs at run-time provides opportunities to apply aggressive optimizations to programs based on information that was not available at compile time. At run time, programs can be adapted to better exploit architectural features, optimize the use of dynamic libraries, and simplify code ba ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Optimizing programs at run-time provides opportunities to apply aggressive optimizations to programs based on information that was not available at compile time. At run time, programs can be adapted to better exploit architectural features, optimize the use of dynamic libraries, and simplify code based on run-time constants. Our profiling system provides a framework for collecting information required for performing run-time optimization. We sample the performance hardware registers available on an ltanium processor, and select a set of code that is likely to lead to important performance-events. We gather distribution information about the performance-events we wish to monitor, and test our traces by estimating the ability for dynamic patching of a program to execute run-time generated traces. Our results show that we are able to capture 58 % of execution time across various SPEC2000 integer benchmarks using our profile and patching techniques on a relatively small number of frequently executed execution paths. Our profiling and detection system overhead increases execution time by only 2-4%. 1.

