Results 1 - 10
of
11
Efficient, Transparent and Comprehensive Runtime Code Manipulation
, 2004
"... This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every i ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction — which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools — it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the
Using dynamic binary translation to fuse dependent instructions
- In CGO ’04: Proceedings of the international symposium on Code generation and optimization
, 2004
"... Instruction scheduling hardware can be simplified and easily pipelined if pairs of dependent instructions are fused so they share a single instruction scheduling slot. We study an implementation of the x86 ISA that dynamically translates x86 code to an underlying ISA that supports instruction fusing ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Instruction scheduling hardware can be simplified and easily pipelined if pairs of dependent instructions are fused so they share a single instruction scheduling slot. We study an implementation of the x86 ISA that dynamically translates x86 code to an underlying ISA that supports instruction fusing. A microarchitecture that is co-designed with the fused instruction set completes the implementation. In this paper, we focus on the dynamic binary translator for such a co-designed x86 virtual machine. The dynamic binary translator first cracks x86 instructions belonging to hot superblocks into RISC-style micro-operations, and then uses heuristics to fuse together pairs of dependent micro-operations. Experimental results with SPEC2000 integer benchmarks demonstrate that: (1) the fused ISA with dynamic binary translation reduces the number of scheduling decisions by about 30 % versus a conventional implementation that uses hardware cracking into RISC micro-operations; (2) an instruction scheduling slot needs only hold two source register fields even though it may hold two instructions; (3) translations generated in the proposed ISA consume about 30 % less storage than a corresponding fixed-length RISC-style ISA. 1.
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Abstract
"... Dynamic instrumentation systems have proven to be extremely valuable for program introspection, architectural simulation, and bug detection. Yet a major drawback of modern instrumentation systems is that the instrumented applications often execute several orders of magnitude slower than native appli ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Dynamic instrumentation systems have proven to be extremely valuable for program introspection, architectural simulation, and bug detection. Yet a major drawback of modern instrumentation systems is that the instrumented applications often execute several orders of magnitude slower than native application performance. In this paper, we present a novel approach to dynamic instrumentation where several non-overlapping slices of an application are launched as separate instrumentation threads and executed in parallel in order to approach real-time performance. A direct implementation of our technique in the Pin dynamic instrumentation system results in dramatic speedups for various instrumentation tasks – often resulting in orderof-magnitude performance improvements. Our implementation is available as part of the Pin distribution, which has been downloaded over 10,000 times since its release. 1.
NANA: A Nano-scale Active Network Architecture
- ACM Journal on Emerging Technologies in Computing Systems
, 2006
"... This paper explores the architectural challenges introduced by emerging bottom-up fabrication of nanoelectronic circuits. The specific nanotechnology we explore proposes patterned DNA nanostructures as a scaffold for the placement and interconnection of carbon nanotube or silicon nanorod FETs to cre ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
This paper explores the architectural challenges introduced by emerging bottom-up fabrication of nanoelectronic circuits. The specific nanotechnology we explore proposes patterned DNA nanostructures as a scaffold for the placement and interconnection of carbon nanotube or silicon nanorod FETs to create a limited size circuit (node). Three characteristics of this technology that significantly impact architecture are 1) limited node size, 2) random node interconnection, and 3) high defect rates. We present and evaluate an accumulator-based active network architecture that is compatible with any technology that presents these three challenges. This architecture represents an initial, unoptimized solution for understanding the implications of DNA-guide self-assembly.
A Dependency Chain Clustered Microarchitecture
- In International Parallel and Distributed Processing Symposium
, 2005
"... In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. T ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. This is made possible by assuming support for executing compilerconstructed traces. One trace is executed at a time by executing its coarse-grained dependency chains (DCs) in different in-order clusters. Since the DCs of a trace are mutually data independent of each other they can be executed in different clusters without any direct communication between them. To execute DCs in narrower clusters without compromising ILP, a compiler algorithm that splits large DCs by duplicating instructions is proposed.
Static strands: Safely exposing dependence chains for increasing embedded power efficiency
- In Proc. 2005 Conference on Languages, Compilers, and Tools for Embedded Systems
, 2005
"... Modern embedded processors are designed to maximize execution efficiency—the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependence chains without fan-out, which are exposed b ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Modern embedded processors are designed to maximize execution efficiency—the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependence chains without fan-out, which are exposed by a compiler pass. These dependent instructions are resequenced to be sequential and annotated to communicate their location to the hardware. Importantly, this modified application is binary compatible and functionally identical to the original, allowing transparent execution on a baseline processor. However, these static strands can be easily collapsed and optimized by simple processor modifications, significantly reducing the workload energy. Results show that over 30 % of MediaBench and Spec2000int dynamic instructions can be collapsed, reducing issue logic energy by 20%, bypass energy 19%, and register file energy 14%. In addition, by increasing the effective capactity of pipeline resources by almost a third, average IPC can be improved up to 15%. This performance gain can then be traded in for a lower clock frequency to maintain a basline
Dynamic Software Trace Caching
"... Caching basic blocks in the most frequent order greatly increases fetch bandwidth. Traditional compile-time code reordering requires a profile feedback step, which is an obstacle in itself, and is susceptible to run-time program behavior changes. On the other hand, hardware trace caches are limited ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Caching basic blocks in the most frequent order greatly increases fetch bandwidth. Traditional compile-time code reordering requires a profile feedback step, which is an obstacle in itself, and is susceptible to run-time program behavior changes. On the other hand, hardware trace caches are limited both in capacity and trace construction window size. We propose a software-managed trace cache mechanism that improves instruction fetch performance by dynamic code straightening and provides dynamic binary translation/optimization opportunities based on runtime program behavior.
Code Cache Management in Dynamic Optimization Systems
, 2004
"... Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional constraints reduce the effectiveness of conventional cache management policies. This dissertation investigates the code cache management problem in dynamic optimization systems and presents three major advances that cover the design space of cache management decisions. Through code cache simulations, we show that a FIFO replacement policy outperforms other traditional policies, as it enables contiguous cache evictions, allows for a simple circular buffer implementation, and results in comparable cache miss rates to LRU. Furthermore, a pseudo-circular FIFO algorithm is presented, which handles the problem of un-deletable cache blocks. An investigation of cache eviction granularities also reveals that evicting more than the minimum number of superblocks from the code cache at a time results in
A co-designed virtual machine for instruction level distributed processing
"... A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis research reported here advocates a microarchitecture and design paradigm that rely less on low-level speculation techniques and more on simpler, modular designs with distributed processing at the instruction level, i.e., instruction-level distributed processing (ILDP). This thesis shows that designing a hardware/software co-designed virtual machine (VM) system using an accumulator-oriented instruction set architecture (ISA) and microarchitecture is a good approach for implementing complexity-effective, high-performance out-of-order superscalar machines. The following three key points support this conclusion: • An accumulator-oriented instruction format and microarchitecture fit today’s technology constraints better than conventional design approaches: The ILDP ISA format assigns temporary values that account for most of the register communication to a small number of accumulators. As a result, the complexity of the register file and associated hardware
Reducing startup time in co-designed virtual machines
- In Proc. of the 33rd Annual International Symposium on Computer Architecture
, 2006
"... A Co-Designed Virtual Machine allows designers to implement a processor via a combination of hardware and software. Dynamic binary translation converts code written for a conventional (legacy) ISA into optimized code for an underlying implementation-specific ISA. Because translation is done dynamica ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A Co-Designed Virtual Machine allows designers to implement a processor via a combination of hardware and software. Dynamic binary translation converts code written for a conventional (legacy) ISA into optimized code for an underlying implementation-specific ISA. Because translation is done dynamically, an important consideration in such systems is the startup time for performing the initial translations. Beginning with a previously proposed co-designed VM that implements the x86 ISA, we study runtime binary translation overhead effects. The co-designed x86 virtual machine is based on an adaptive translation system that uses a basic block translator for initial emulation and a superblock translator for hotspot optimization. We analyze and model VM startup performance via simulation. We observe that non-hotspot emulation via basic block translation is the major part of the startup overhead. To reduce startup translation overhead, we follow the co-designed hardware / software philosophy and propose hardware assists to dramatically accelerate basic block translations. By combining hardware assists with balanced translation strategies, the co-designed translation system reduces runtime overhead significantly and demonstrates very competitive startup performance when compared with conventional processors running a set of Windows application benchmarks. 1

