Results 1 - 10
of
25
A Programmable Co-processor for Profiling
- IN PROCEEDINGS OF THE 7TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-7
, 2001
"... Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile sam ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile samples generated by a microprocessor. From this stream of samples, the co-processor can detect correlations between instructions (e.g., memory dependence profiling) as well as those between different dynamic instances of the same instruction (e.g., value profiling). The profiler's programmable nature allows a broad range of data to be extracted, post-processed, and formatted, as well as provides the flexibility to tailor the profiling application to the program under test. Because the co-processor is specialized for profiling, it can execute profiling applications more efficiently than a general-purpose processor. The co-processor should not significantly impact the cost or performance of the ...
Design and Implementation of a Lightweight Dynamic Optimization System
- Journal of Instruction-Level Parallelism
, 2004
"... Many opportunities exist to improve micro-architectural performance due to performance events that are di#cult to optimize at static compile time. Cache misses and branch mis-prediction patterns may vary for di#erent micro-architectures using di#erent inputs. ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
Many opportunities exist to improve micro-architectural performance due to performance events that are di#cult to optimize at static compile time. Cache misses and branch mis-prediction patterns may vary for di#erent micro-architectures using di#erent inputs.
Increasing the size of atomic instruction blocks using control flow assertions
- In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture
, 2000
"... For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper, we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper, we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames and we construct them by replacing original branch instructions with assertions. Assertion instructions check if the original branching conditions still hold. If they hold, no action is taken. If they do not, then the entire region is undone. In this manner, an assertion has no explicit control flow. We demonstrate that using branch correlation to decide when a branch should be converted into an assertion results in atomic regions that average over 100 instructions in length, with a probability of completion of 97%, and that constitute over 80 % of the dynamic instruction stream. We demonstrate both static and dynamic means for constructing frames. When frames are built dynamically using finite sized hardware, they average 80 instructions in length and have good caching properties. 1
An Architectural Framework for Run-Time Optimization
- IEEE Transactions on Computers
, 2001
"... Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying run-time optimized code. The mechanism can be viewed as a filtering system, that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for run-time optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations. 1
Performance Characterization of a Hardware Mechanism for Dynamic Optimization
- In 34 th International Symposium on Microarchitecture
, 2001
"... We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run c ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run concurrently with program execution. The recovery mechanism enables the optimizer to make speculative optimizations without requiring recovery code.
LLVA: A Low-level Virtual Instruction Set Architecture
- IN MICRO-36
, 2003
"... A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates tran ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates translation and optimization. In fact, there has been little research on specific designs for a virtual ISA for processors. This paper proposes a novel virtual ISA (LLVA) and a translation strategy for implementing it on arbitrary hardware. The instruction set is typed, uses an infinite virtual register set in Static Single Assignment form, and provides explicit control-flow and dataflow information, and yet uses low-level operations closely matched to traditional hardware. It includes novel mechanisms to allow more flexible optimization of native code, including a flexible exception model and minor constraints on self-modifying code. We propose a translation strategy that enables offline translation and transparent offline caching of native code and profile information, while remaining completely OS-independent. It also supports optimizations directly on the representation at install-time, runtime, and offline between executions. We show experimentally that the virtual ISA is compact, it is closely matched to ordinary hardware instruction sets, and permits very fast code generation, yet has enough high-level information to permit sophisticated program analyses and optimizations.
Master/Slave Speculative Parallelization and Approximate Code
, 2002
"... This dissertation describes Master/Slave Speculative Parallelization (MSSP), a novel execution paradigm to improve the execution rate of sequential programs by parallelizing them speculatively for execution on a multiprocessor. In MSSP, one processor—the master—executes an approximate copy of the pr ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This dissertation describes Master/Slave Speculative Parallelization (MSSP), a novel execution paradigm to improve the execution rate of sequential programs by parallelizing them speculatively for execution on a multiprocessor. In MSSP, one processor—the master—executes an approximate copy of the program to compute values the program’s execution is expected to compute. The master’s results are then checked by the slave processors by comparing them to the results computed by the original program. This validation is parallelized by cutting the program’s execution into tasks. Each slave uses its predicted inputs (as computed by the master) to validate the input predictions of the next task, inductively validating the whole execution. Approximate code, because it has no correctness requirements—in essence it is a software value predictor—can be optimized more effectively than traditionally generated code. It is free to sacrifice correctness in the uncommon case in order to maximize performance in the common case. In addition to introducing the notion of approximate code, this dissertation describes a prototype implementation of a program distiller that uses profile information to automatically generate approximate code. The distiller first applies unsafe transformations to remove uncommon case behaviors that are preventing optimization;
Continuous Adaptive Object-Code Reoptimization Framework
- Ninth Asia-Pacific Computer Systems Architecture Conference
, 2004
"... Abstract. Dynamic optimization presents opportunities for finding run-time bottlenecks and deploying optimizations in statically compiled programs. In this paper, we discuss our current implementation of our hardware sampling based dynamic optimization framework and applying our dynamic optimization ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Abstract. Dynamic optimization presents opportunities for finding run-time bottlenecks and deploying optimizations in statically compiled programs. In this paper, we discuss our current implementation of our hardware sampling based dynamic optimization framework and applying our dynamic optimization system to various SPEC2000 benchmarks compiled with the ORC compiler at optimization level O2 and executed on an Itanium-2 machine. We use our optimization system to apply memory prefetching optimizations, improving the performance of multiple benchmark programs. 1
Continuous optimization
- In Proc. of the 32nd Annual International Symposium on Computer Architecture
, 2004
"... This paper presents a hardware-based dynamic optimizer that continuously optimizes an application’s instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer r ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper presents a hardware-based dynamic optimizer that continuously optimizes an application’s instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer reduces dataflow height by performing constant propagation, reassociation, redundant load elimination, store forwarding, and silent store removal. To enhance the impact of the optimizations, the optimizer integrates values generated by the execution units back into the optimization process. Continuous optimization allows instructions with input values known at optimization time to be executed in the optimizer, leaving less work for the out-of-order portion of the pipeline. Continuous optimization can detect branch mispredictions earlier and thus reduce the misprediction penalty. In this paper, we present a detailed description of a hardware optimizer and evaluate it in the context of a contemporary microarchitecture running current workloads. Our analysis of SPECint, SPECfp, and mediabench workloads reveals that a hardware optimizer can directly execute 33 % of instructions, resolve 29 % of mispredicted branches, and generate addresses for 76 % of memory operations. These positive effects combine to provide speed ups in the range 0.99 to 1.27. 1.
Continuous Path and Edge Profiling
- IN IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2005
"... Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are difficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow profiles to inform their decisions, but prior approac ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are difficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow profiles to inform their decisions, but prior approaches have struggled to meet these goals simultaneously. This paper presents PEP, a

