Results 1 - 10
of
105
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN Notices
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 347 (1 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Memory System Characterization of Commercial Workloads
- In Proceedings of the 25th annual international symposium on Computer architecture
, 1998
"... Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as ..."
Abstract
-
Cited by 203 (5 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. 1
Architectural Support for Copy and Tamper Resistant Software
, 2000
"... Implementing copy protection on software is a difficult problem that has resisted a satisfactory solution for many years. This paper proposes a set of features that allows a machine to execute XOM code: code where neither the instructions or the data are visible to entities outside the running proce ..."
Abstract
-
Cited by 180 (5 self)
- Add to MetaCart
Implementing copy protection on software is a difficult problem that has resisted a satisfactory solution for many years. This paper proposes a set of features that allows a machine to execute XOM code: code where neither the instructions or the data are visible to entities outside the running process. To support XOM code we use a machine that supports internal compartments, where a process in one compartment cannot read data from another compartment. All data that leaves the machine is encrypted, since we assume secure compartments cannot be guaranteed by anything outside the machine. The design of this machine poses some interesting trade-offs between security, efficiency and flexibility. We explore some of the potential security issues as one pushes the machine to become more efficient and flexible. Our analysis indicates, while not cheap, it is possible to create a normal multi-tasking machine where nearly all applications can be run in XOM mode. While a virtual XOM machine is possible, the underlying hardware needs to support a unique private key, asymmetric decryption, private memory, fast symmetric ciphers, and traps on cache misses for efficient operation.
Piranha: A scalable architecture based on single-chip multiprocessing
- SIGARCH Comput. Archit. News
, 2000
"... The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instructionlevel parallelism. Meanwhile, such designs are especially ill suited for important commercial application ..."
Abstract
-
Cited by 175 (5 self)
- Add to MetaCart
The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instructionlevel parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors
Using the SimOS Machine Simulator to Study Complex Computer Systems
- ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION
, 1997
"... ... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simul ..."
Abstract
-
Cited by 144 (5 self)
- Add to MetaCart
... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simulation models for each hardware component. By selecting the appropriate combination of simulation models, the user can explicitly control the tradeoff between simulation speed and simulation detail. To handle the large amount of low-level data generated by the hardware simulation models, SimOS contains flexible annotation and event classification mechanisms that map the data back to concepts meaningful to the user. SimOS has been extensively used to study new computer hardware designs, to analyze application performance, and to study operating systems. We include two case studies that demonstrate how a low-level machine simulator such as SimOS can be used to study large and complex workloads.
Trace-Driven Memory Simulation: A Survey
- ACM Computing Surveys
, 2004
"... This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show t ..."
Abstract
-
Cited by 134 (0 self)
- Add to MetaCart
This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered. In a concluding section, we examine fundamental limitations to trace-driven simulation, and survey some recent developments in memory simulation that may overcome these bottlenecks
An Infrastructure for Adaptive Dynamic Optimization
, 2003
"... Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework ..."
Abstract
-
Cited by 130 (5 self)
- Add to MetaCart
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight, API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc.. To demonstrate
Performance Analysis Using the MIPS R10000 Performance Counters
, 1996
"... : Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for c ..."
Abstract
-
Cited by 91 (0 self)
- Add to MetaCart
: Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications, which illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications. Keywords: performance analysis, profiling tools, hardware performance counters, MIPS R10000, SGI Power Challenge 1. Introduction A fundamental question asked by HPC application developers is: "Where is the time spent?"...
SimICS/sun4m: A virtual workstation
- IN PROCEEDINGS OF THE USENIX ANNUAL TECHNICAL CONFERENCE
, 1998
"... System level simulators allow computer architects and system software designers to recreate an accurate and complete replica of the program behavior of a target system, regardless of the availability, existence, or instrumentation support of such a system. Applications include evaluation of architec ..."
Abstract
-
Cited by 74 (14 self)
- Add to MetaCart
System level simulators allow computer architects and system software designers to recreate an accurate and complete replica of the program behavior of a target system, regardless of the availability, existence, or instrumentation support of such a system. Applications include evaluation of architectural design alternatives as well as software engineering tasks such as traditional debugging and performance tuning. We present an implementation of a simulator acting as a virtual workstation fully compatible with the sun4m architecture from Sun Microsystems. Built using the system-level SPARC V8 simulator SimICS, SimICS/sun4m models one or more SPARC V8 processors, supports user-developed modules for data cache
ATEMU: A fine-grained sensor network simulator
- IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks
, 2004
"... Abstract—In this paper we describe the design and implementation of ATEMU, a fine grained sensor network simulator. ATEMU is intended to bridge the gap between actual sensor network deployments and sensor network simulations. We adopt a hybrid strategy, where the operation of individual sensor nodes ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
Abstract—In this paper we describe the design and implementation of ATEMU, a fine grained sensor network simulator. ATEMU is intended to bridge the gap between actual sensor network deployments and sensor network simulations. We adopt a hybrid strategy, where the operation of individual sensor nodes is emulated in an instruction by instruction manner, and their interactions with each other via wireless transmissions are simulated in a realistic manner. A unique feature of ATEMU is its ability to simulate a heterogeneous sensor network. Using ATEMU it is possible to not only accurately simulate the operation of different application on the MICA2 platform but also a complete sensor network where the sensor nodes themselves maybe based on different hardware platforms. In addition we also describe our implementation of XATDB, our front-end debugger/GUI for ATEMU. XATDB provides an excellent educational tool for people to start learning about the operation of sensor nodes and sensor networks, without requiring the purchase of actual sensor node hardware. The accuracy and emulation capabilities provided by ATEMU ensure that when and if actual hardware is used, the software will already have undergone rigorous testing and debugging on an accurate platform. This would provide the sensor network deployment community with a much more accurate estimate of the performance of various algorithms and protocols in realistic scenarios and platforms. I.

