Results 1 - 10
of
38
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN Notices
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 347 (1 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Embra: Fast and Flexible Machine Simulation
- In Measurement and Modeling of Computer Systems
, 1996
"... This paper describes Embra, a simulator for the processors, caches, and memory systems of uniprocessors and cache-coherent multiprocessors. When running as part of the SimOS simulation environment, Embra models the processors of a MIPS R3000/R4000 machine faithfully enough to run a commercial operat ..."
Abstract
-
Cited by 146 (3 self)
- Add to MetaCart
This paper describes Embra, a simulator for the processors, caches, and memory systems of uniprocessors and cache-coherent multiprocessors. When running as part of the SimOS simulation environment, Embra models the processors of a MIPS R3000/R4000 machine faithfully enough to run a commercial operating system and arbitrary user applications. To achieve high simulation speed, Embra uses dynamic binary translation to generate code sequences which simulate the workload. It is the first machine simulator to use this technique. Embra can simulate real workloads such as multiprocess compiles and the SPEC92 benchmarks running on Silicon Graphic's IRIX 5.3 at speeds only 3 to 9 times slower than native execution of the workload, making Embra the fastest reported complete machine simulator. Dynamic binary translation also gives Embra the flexibility to dynamically control both the simulation statistics reported and the simulation model accuracy with low performance overheads. For example, Embra...
Using the SimOS Machine Simulator to Study Complex Computer Systems
- ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION
, 1997
"... ... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simul ..."
Abstract
-
Cited by 144 (5 self)
- Add to MetaCart
... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simulation models for each hardware component. By selecting the appropriate combination of simulation models, the user can explicitly control the tradeoff between simulation speed and simulation detail. To handle the large amount of low-level data generated by the hardware simulation models, SimOS contains flexible annotation and event classification mechanisms that map the data back to concepts meaningful to the user. SimOS has been extensively used to study new computer hardware designs, to analyze application performance, and to study operating systems. We include two case studies that demonstrate how a low-level machine simulator such as SimOS can be used to study large and complex workloads.
Software profiling for hot path prediction: less is more
- SIGPLAN Not
"... Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide highly accurate information in an offline se ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide highly accurate information in an offline setting are ill-suited for these dynamic code generation systems. We experimentally demonstrate that hot path predictions must be made early in order to control the rising cost of missed opportunity that result from the prediction delay. We also show that existing sophisticated path profiling schemes, if used in an online setting, offer no prediction advantages over simpler schemes that exhibit much lower runtime overheads. Based on these observation we developed a new low-overhead software profiling scheme for hot path prediction. Using an abstract metric we compare our scheme to path profile based prediction and show that our scheme achieves comparable prediction quality. In our second set of experiments we include runtime overhead and evaluate the performance of our scheme in a realistic application: Dynamo, a dynamic optimization system. The results show that our prediction scheme clearly outperforms path profile based prediction and thus confirm that less profiling as exhibited in our scheme will actually lead to more effective hot path prediction. 1.
DELI: A New Run-Time Control Point
- In 35th International Symposium on Microarchitecture
, 2002
"... The Dynamic Execution Layer Interface (DELI) offers the following unique capability: it provides fine-grain control over the execution of programs, by allowing its clients to observe and optionally manipulate every single instruction ---at run time---just before it runs. DELI accomplishes this by op ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
The Dynamic Execution Layer Interface (DELI) offers the following unique capability: it provides fine-grain control over the execution of programs, by allowing its clients to observe and optionally manipulate every single instruction ---at run time---just before it runs. DELI accomplishes this by opening up an interface to the layer between the execution of software and hardware. To avoid the slowdown, DELI caches a private copy of the executed code and always runs out of its own private cache.
Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
- In Supercomputing
, 1999
"... Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PI ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the mem...
Transparent dynamic optimization: The design and implementation of Dynamo
, 1999
"... dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capabl ..."
Abstract
-
Cited by 49 (4 self)
- Add to MetaCart
dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language,
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems
"... Microprocessor execution speeds are improving at a rate of 50%-80% per year while DRAM access times are improving at a much lower rate of 5%-10% per year. Computer systems are rapidly approaching the point at which overall system performance is determined not by the speed of the CPU but by the memor ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Microprocessor execution speeds are improving at a rate of 50%-80% per year while DRAM access times are improving at a much lower rate of 5%-10% per year. Computer systems are rapidly approaching the point at which overall system performance is determined not by the speed of the CPU but by the memory system speed. We present a high performance memory system architecture that overcomes the growing speed disparity between high performance microprocessors and current generation DRAMs. A novel prediction and prefetching technique is combined with a distributed cache architecture to build a high performance memory system. We use a table based prediction scheme with a prediction cache to prefetch data from the on-chip DRAM array to an on-chip SRAM prefetch buffer. By prefetching data we are able to hide the large latency associated with DRAM access and cycle times. Our experiments show that with a small (32 KB) prediction cache we can get an effective main memory access time that is close to the access time of larger secondary caches.
Efficient Performance Prediction for Modern Microprocessors
, 1999
"... Performance estimation of computer systems is an important topic to a large number of people in the computer industry. Computer architects need to be able to study future machines, compiler writers need to be able to evaluate the compiler output before a machine exists, and developers need insight i ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Performance estimation of computer systems is an important topic to a large number of people in the computer industry. Computer architects need to be able to study future machines, compiler writers need to be able to evaluate the compiler output before a machine exists, and developers need insight into the machine's performance in order to tune their code. There are many performance estimation techniques that range from profile -based approaches to full machine simulation. Detailed simulation is one of the most common methods for estimating performance. It suffers, however, from potentially long run times when simulating large applications using detailed processor models. This thesis
Global Context-Based Value Prediction
, 1999
"... Various methods for value prediction have been proposed to overcome the limits imposed by data dependencies within programs. Using a value prediction scheme, an instruction's computed value is predicted during the fetch stage and forwarded to all dependent instructions to speed up execution. Value p ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Various methods for value prediction have been proposed to overcome the limits imposed by data dependencies within programs. Using a value prediction scheme, an instruction's computed value is predicted during the fetch stage and forwarded to all dependent instructions to speed up execution. Value prediction schemes have been based on a local context by predicting values using the values generated by the same instruction. This paper presents techniques that predict values of an instruction based on a global context where the behavior of other instructions is used in prediction. The global context includes the path along which an instruction is executed and the values computed by other previously completed instructions. We present techniques that augment conventional last value and stride predictors with global context information. Experiments performed using path-based techniques with realistic table sizes resulted in an increase in prediction of 6.4-8.4% over the current prediction sc...

