Results 1 - 10
of
310
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN NOTICES
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 479 (2 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Secure Execution Via Program Shepherding
, 2002
"... We introduce program shepherding, a method for monitoring control flow transfers during program execution to enforce a security policy. Program shepherding provides three techniques as building blocks for security policies. First, shepherding can restrict execution privileges on the basis of code or ..."
Abstract
-
Cited by 308 (5 self)
- Add to MetaCart
(Show Context)
We introduce program shepherding, a method for monitoring control flow transfers during program execution to enforce a security policy. Program shepherding provides three techniques as building blocks for security policies. First, shepherding can restrict execution privileges on the basis of code origins. This distinction can ensure that malicious code masquerading as data is never executed, thwarting a large class of security attacks. Second, shepherding can restrict control transfers based on instruction class, source, and target. For example, shepherding can forbid execution of shared library code except through declared entry points, and can ensure that a return instruction only targets the instruction after a call. Finally, shepherding guarantees that sandboxing checks placed around any type of program operation will never be bypassed. We have implemented these capabilities efficiently in a runtime system with minimal or no performance penalties. This system operates on unmodified native binaries, requires no special hardware or operating system support, and runs on existing IA-32 machines under both Linux and Windows.
Transient Fault Detection via Simultaneous Multithreading
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor fa ..."
Abstract
-
Cited by 267 (7 self)
- Add to MetaCart
Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations. We demonstrate
DAISY: Dynamic Compilation for 100% Architectural Compatibility
, 1997
"... Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instructi ..."
Abstract
-
Cited by 206 (13 self)
- Add to MetaCart
Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorlaown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Firtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordedng of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
An Infrastructure for Adaptive Dynamic Optimization
, 2003
"... Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework ..."
Abstract
-
Cited by 189 (6 self)
- Add to MetaCart
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight, API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc.. To demonstrate
Clustered Speculative Multithreaded Processors
, 1999
"... In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. T ..."
Abstract
-
Cited by 180 (10 self)
- Add to MetaCart
In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. These speculative threaak use otherwise idle resources of the machine. Spawning a speculative thread involves predicting its control flow as well as its dependences with other threads and the values that flow through them. In this way, threads fhat are not independent can be executed in parallel. Control-Jlow, data value and data dependence predictors particularly designedfor this type of microarchitecture are presented. Results show the potential of the microarchitecture to exploit speculative parallelism in programs that are hard to parallelize at compile-time, such as the SpecInt9.5. For a 4-thread unit configuration, some programs such as ijpeg and Ii can exploit an average degree of parallelism of more than 2 threads per cycle. The average degree ofparallelism for the whole SpecInt95 suite is 1.6 threads per cycle. This speculative parallelism results in significant speedups for all the Speclnt95 programs when compared with a single-thread execution.
Trace processors
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
A Chip-Multiprocessor Architecture with Speculative Multithreading.
- IEEE Trans. on Computers,
, 1999
"... ..."
(Show Context)
Selective value prediction
- In 26th Annual International Symposium on Computer Architecture
, 1999
"... Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines sel ..."
Abstract
-
Cited by 138 (13 self)
- Add to MetaCart
(Show Context)
Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines selective techniques for using value prediction in the presence of predictor capacity constraints and reasonable misprediction penalties. We examine prediction and confidence mechanisms in light of these constraints, and we minimize capacity conflicts through instruction filtering. The latter technique filters which instructions put values into the value prediction table. We examine filtering techniques based on instruction type, as well as giving priority to instructions belonging to the longest data dependence path in the processor’s active instruction window. We apply filtering both to the producers of predicted values and the consumers. In addition, we examine the benefit of using different confidence levels for instructions using predicted values on the longest dependence path. 1
Dynamic Speculative Precomputation
, 2001
"... A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, t ..."
Abstract
-
Cited by 105 (10 self)
- Add to MetaCart
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations. 1.