Results 1 - 10
of
11
Performance Characterization of a Hardware Mechanism for Dynamic Optimization
- In 34 th International Symposium on Microarchitecture
, 2001
"... We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run c ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We evaluate the rePLay microarchitecture as a means for reducing application execution time by facilitating dynamic optimization. The framework contains a programmable optimization engine coupled with a hardware-based recovery mechanism. The optimization engine enables the dynamic optimizer to run concurrently with program execution. The recovery mechanism enables the optimizer to make speculative optimizations without requiring recovery code.
Filtering Techniques to Improve Trace-Cache Efficiency
- in Proc. of the International Conference on Parallel Architectures and Compilation Techniques
, 2001
"... The trace cache is becoming an important building block of modern, wide-issue, processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown to effectively increase the number of “useful ” instructions that can be fetched into the machi ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
The trace cache is becoming an important building block of modern, wide-issue, processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown to effectively increase the number of “useful ” instructions that can be fetched into the machine, thus enabling more instructions to be executed each cycle. However, trace cache has another important benefit that got less attention in recent research: especially for variable length ISA, such as Intel’s IA-32 architecture (X86), reducing instruction decoding power is particularly attractive. Keeping the instruction traces in decoded format, implies the decoding power is only paid upon the build of a trace, thus reducing the overall power consumption of the
Fetching instruction streams
- In Procs. of the 36th Intl. Symposium on Microarchitecture
, 2002
"... Fetch performance is a very important factor because it effectively limits the overall processor performance. How-ever, there is little performance advantage in increasing front-end performance beyond what the back-end can con-sume. For each processor design, the target is to build the best possible ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
Fetch performance is a very important factor because it effectively limits the overall processor performance. How-ever, there is little performance advantage in increasing front-end performance beyond what the back-end can con-sume. For each processor design, the target is to build the best possible fetch engine for the required performance level A fetch engine will be better if it provides better per-formance, but also if it takes fewer resources, requires less chip area, or consumes less power. In this paper we propose a novel fetch architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our architecture in detail, and show that it re-quires less complexity and resources than other high perfor-mance fetch architectures like the trace cache, while provid-ing a high fetch performance suitable for wide-issue super-scalar processors. Our results show that using our fetch architecture and code layout optimizations obtains 10 % higher performance than the EV8 fetch architecture, and 4 % higher than the FTB architecture using state-of-the-art branch predictors, while being only 1.5 % slower than the trace cache. Even in the absence of code layout optimizations, fetching instruc-tion streams is still lO % faster than the EV8, and only 4% slower than the trace cache. Fetching instruction streams effectively exploits the spe-cial characteristics of layout optimized codes to provide a high fetch performance, close to that of a trace cache, but has a much lower cost and complexity, similar to that of a basic block architecture. 1.
Managing bounded code caches in dynamic binary optimization systems
- ACM Trans. on Architecture and Code Optimization
"... Dynamic binary optimizers store altered copies of original program instructions in softwaremanaged code caches in order to maximize reuse of transformed code. Code caches store code blocks that may vary in size, reference other code blocks, and carry a high replacement overhead. These unique constra ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Dynamic binary optimizers store altered copies of original program instructions in softwaremanaged code caches in order to maximize reuse of transformed code. Code caches store code blocks that may vary in size, reference other code blocks, and carry a high replacement overhead. These unique constraints reduce the effectiveness of conventional cache management policies. Our work directly addresses these unique constraints and presents several contributions to the code-cache management problem. First, we show that evicting more than the minimum number of code blocks from the code cache results in less run-time overhead than the existing alternatives. Such granular evictions reduce overall execution time, as the fixed costs of invoking the eviction mechanism are amortized across multiple cache insertions. Second, a study of the ideal lifetimes of dynamically generated code blocks illustrates the benefit of a replacement algorithm based on a generational heuristic. We describe and evaluate a generational approach to code cache management that makes it easy to identify long-lived code blocks and simultaneously avoid any fragmentation because of the eviction of short-lived blocks. Finally, we present results from an implementation of our generational approach in the DynamoRIO framework and illustrate that, as dynamic optimization systems become more prevalent, effective code cache-management policies will be essential for reliable, scalable performance of modern applications.
Design Alternatives For Caching Long Regions Of The Dynamic Instruction Stream
, 2001
"... Noncontiguous control flow challenges high-bandwidth execution in microprocessors by prematurely terminating a fetch to less than a full fetch width. To deal with this problem, methods have been devised ranging from branch prediction schemes to compiler techniques for reducing taken control flow to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Noncontiguous control flow challenges high-bandwidth execution in microprocessors by prematurely terminating a fetch to less than a full fetch width. To deal with this problem, methods have been devised ranging from branch prediction schemes to compiler techniques for reducing taken control flow to hardware mechanisms for caching dynamic traces from the instruction stream. Recently, a technique to form long instruction sequences called frames using branch promotion has been proposed. Frames are instruction entities that can grow to be very long and must be cached as atomic units.
Dynamic Software Trace Caching
"... Caching basic blocks in the most frequent order greatly increases fetch bandwidth. Traditional compile-time code reordering requires a profile feedback step, which is an obstacle in itself, and is susceptible to run-time program behavior changes. On the other hand, hardware trace caches are limited ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Caching basic blocks in the most frequent order greatly increases fetch bandwidth. Traditional compile-time code reordering requires a profile feedback step, which is an obstacle in itself, and is susceptible to run-time program behavior changes. On the other hand, hardware trace caches are limited both in capacity and trace construction window size. We propose a software-managed trace cache mechanism that improves instruction fetch performance by dynamic code straightening and provides dynamic binary translation/optimization opportunities based on runtime program behavior.
Code Cache Management in Dynamic Optimization Systems
, 2004
"... Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional constraints reduce the effectiveness of conventional cache management policies. This dissertation investigates the code cache management problem in dynamic optimization systems and presents three major advances that cover the design space of cache management decisions. Through code cache simulations, we show that a FIFO replacement policy outperforms other traditional policies, as it enables contiguous cache evictions, allows for a simple circular buffer implementation, and results in comparable cache miss rates to LRU. Furthermore, a pseudo-circular FIFO algorithm is presented, which handles the problem of un-deletable cache blocks. An investigation of cache eviction granularities also reveals that evicting more than the minimum number of superblocks from the code cache at a time results in
A co-designed virtual machine for instruction level distributed processing
"... A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradigm that rely less on low-level speculation techniques and more on simpler, modular designs with distributed processing at the instruction level, i.e., instruction-level distributed processing (ILDP). This thesis shows that designing a hardware/software co-designed virtual machine (VM) system using an accumulator-oriented instruction set architecture (ISA) and microarchitecture is a good approach for implementing complexity-effective, high-performance out-of-order superscalar machines. The following three key points support this conclusion: • An accumulator-oriented instruction format and microarchitecture fit today’s technology constraints better than conventional design approaches: The ILDP ISA format assigns temporary values that account for most of the register communication to a small number of accumulators. As a result, the complexity of the register file and associated hardware
An analysis of a novel approach to dynamic optimization
, 2003
"... In attempt to achieve higher application performance, compiler researchers have developed a multi-tude of techniques, such as profiling, predication, and hyperblock and superblock scheduling. These techniques capitalize on the static behavior of applications to optimize and schedule instructions for ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In attempt to achieve higher application performance, compiler researchers have developed a multi-tude of techniques, such as profiling, predication, and hyperblock and superblock scheduling. These techniques capitalize on the static behavior of applications to optimize and schedule instructions for high-performance execution. One aspect of performance that compiler techniques are relatively incapable of taking advantage of is the dynamic behavior of programs. Programs often contain phased or temporarily biased branches that yield increased performance opportunities. By exam-ining individual dynamic paths created by these branches, many opportunities for optimization are revealed that were hidden during static compilation. In this thesis, an approach for discovering these paths at application runtime will be proposed. This mechanism can dynamically gather long traces of instructions that make up a highly biased path — we call these traces frames. The op-timization framework guarantees execution of either the entire region or none of the region. This atomic property combined with the size of the instruction regions offers significant potential for improving application performance beyond that obtainable solely through static compilation. The contributions of this thesis are threefold: (1) an evaluation of the performance and effectiveness of classic compiler and frame-specific optimizations performed dynamically, (2) possible design options for implementation, and (3) a look at other avenues for increasing performance using the optimizer in the rePLay framework. iii To Renee, my family, and friends for all their encouragement and confidence. iv ACKNOWLEDGEMENTS
A Comparative Study of Redundancy in Trace Caches
"... Trace cache performance is limited by two types of redundancy: duplication and liveness. In this paper, we show that duplication is not strongly correlated to trace cache performance. Generally, the bestperforming trace caches also introduce the most duplication. The amount of dead traces is ext ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Trace cache performance is limited by two types of redundancy: duplication and liveness. In this paper, we show that duplication is not strongly correlated to trace cache performance. Generally, the bestperforming trace caches also introduce the most duplication. The amount of dead traces is extremely high, ranging from 76% in the smallest trace cache to 35% in the largest trace cache studied. Furthermore, most of these dead traces are never used between storing them and replacing them from the trace cache.

