Results 1 - 10
of
15
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN Notices
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 347 (1 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Quantifying the Complexity of Superscalar Processors
, 1996
"... The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and o ..."
Abstract
-
Cited by 72 (0 self)
- Add to MetaCart
The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m.
Transparent dynamic optimization: The design and implementation of Dynamo
, 1999
"... dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capabl ..."
Abstract
-
Cited by 49 (4 self)
- Add to MetaCart
dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language,
Efficient, Transparent and Comprehensive Runtime Code Manipulation
, 2004
"... This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every i ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction — which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools — it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the
Transparent Dynamic Optimization
, 1999
"... Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Dynamic optimization refers to the runtime optimization of a native program binary. This paper describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language, compiler, operating system or hardware support. The program binary is not instrumented and is left untouched during Dynamo's operation. Dynamo observes the program's behavior through interpretation to dynamically select hot instruction traces from the running program. The hot traces are optimized using low-overhead optimization techniques and emitted into a software code cache. Subsequent instances of these traces cause the cached version to be executed, resulting in a performance boost. Contrary to intuition, we ...
Exploiting Idle Floating-Point Resources For Integer Execution
- IN PROC. OF THE INT. CONF. ON PROGRAMMING LANG. DESIGN AND IMPLEMENTATION
, 1998
"... In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support i ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal. To exploit these
Scalable Register Renaming via the Quack Register File
, 2000
"... To improve the performance of superscalar microprocessors, developers are continually deepening pipelines, increasing the instruction window size, and widening instruction dispatch. As these microarchitecture features grow they stress the register renaming mechanism by increasing the demands for ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
To improve the performance of superscalar microprocessors, developers are continually deepening pipelines, increasing the instruction window size, and widening instruction dispatch. As these microarchitecture features grow they stress the register renaming mechanism by increasing the demands for physical registers, and register read/write bandwidth. There are four major components to register renaming that suffer scaling problems: physical tag assignment (destination allocation), source operand read, result writeback, and result commit. This work introduces a new scalable register renaming scheme, built around the novel Quack register file. The Quack register file has a simple implementation that yields a read latency independent of the physical register count. It also facilitates a mapping strategy that does not require a separate mapping table, or similar structure. The latency and performance of the Quack register file are analyzed and compared to current register renamin...
Transistor Count and Chip-Space Estimation of SimpleScalar-based Microprocessor Models
, 2001
"... This paper proposes a chip space and transistor count estimation tool, which receives its input from the baseline architecture and the configuration file of the microarchitecture performance simulator sim-outorder of the SimpleScalar Tool Set. The estimation tool yields a pre-silicon chip space and ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper proposes a chip space and transistor count estimation tool, which receives its input from the baseline architecture and the configuration file of the microarchitecture performance simulator sim-outorder of the SimpleScalar Tool Set. The estimation tool yields a pre-silicon chip space and transistor count estimation and allows to compare different microprocessor configurations with respect to their potential chip space requirements. The estimation method, which is the basis of our tool, is validated by configuration parameters of a real processor yielding a transistor count and a chip space estimation that is very close to the real processor numbers.
Portable High Performance Programming via Architecture-Cognizant Divide-and-Conquer Algorithms
, 2000
"... ...................................................... xiii 1 Introduction .................................................. 1 1. Divide-and-Conquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of Architecture-Cognizant Divide-and Conquer . . . . . . 4 3. Overview of Napoleon . . . ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
...................................................... xiii 1 Introduction .................................................. 1 1. Divide-and-Conquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of Architecture-Cognizant Divide-and Conquer . . . . . . 4 3. Overview of Napoleon . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. What You Can Expect . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. Divide-and-Conquer Algorithms for Performance Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. The Importance of Architecture-Cognizance . . . . . . . . . 7 3. Complexity of Determining VariantPolicy . . . . . . . . . . 7 4. A Framework and System for Divide-and-Conquer Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. The Fastest Portable FFT Algorithm . . . . . . . . . . . . . 8 6. Outline of Thesis . . . . . . . . . . . . . . . . ....

