Results 1 - 10
of
13
Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency
- in Proceedings of the 28th Annual International Symposium on Microarchitecture
, 1995
"... Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of ..."
Abstract
-
Cited by 67 (5 self)
- Add to MetaCart
Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for...
Vector Microprocessors
- In Hot Chips VII
, 1998
"... Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector superc ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
Vector Microprocessors by Krste Asanovic Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector supercomputer implementations. This thesis expands the body of vector research by examining designs appropriate for single-chip full-custom vector microprocessor implementations targeting a much broader range of applications. I present the design, implementation, and evaluation of T0 (Torrent-0): the first single-chip vector microprocessor. T0 is a compact but highly parallel processor that can sustain over 24 operations per cycle while issuing only a single 32-bit instruction per cycle. T0 demonstrates that vector architectures are well suited to full-custom VLSI implementation and that they perform well on many multimedia and human-machine interface tasks. The remainder of the thesis contains ...
Message dispatch on pipelined processors
- In ECOOP'95 Conference Proceedings
, 1995
"... Abstract. Object-oriented systems must implement message dispatch efficiently in order not to penalize the object-oriented programming style. We characterize the performance of most previously published dispatch techniques for both statically- and dynamically-typed languages with both single and mul ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Abstract. Object-oriented systems must implement message dispatch efficiently in order not to penalize the object-oriented programming style. We characterize the performance of most previously published dispatch techniques for both statically- and dynamically-typed languages with both single and multiple inheritance. Hardware organization (in particular, branch latency and superscalar instruction issue) significantly impacts dispatch performance. For example, inline caching may outperform C++-style “vtables ” on deeply pipelined processors even though it executes more instructions per dispatch. We also show that adding support for dynamic typing or multiple inheritance does not significantly impact dispatch speed for most techniques, especially on superscalar machines. Instruction space overhead (calling sequences) can exceed the space cost of data structures (dispatch tables), so that minimal table size may not imply minimal run-time space usage.
Software and Hardware Techniques for Efficient Polymorphic Calls
, 1999
"... Object-oriented code looks different from procedural code. The main difference is the increased frequency of polymorphic calls. A polymorphic call looks like a procedural call, but where a procedural call has only one possible target subroutine, a polymorphic call can result in the execution of one ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Object-oriented code looks different from procedural code. The main difference is the increased frequency of polymorphic calls. A polymorphic call looks like a procedural call, but where a procedural call has only one possible target subroutine, a polymorphic call can result in the execution of one of several different subroutines. The choice is made at run time, and depends on the type of the receiving object (the first argument). Polymorphic calls enable the construction of clean, modular code design. They allow the programmer to invoke operations on an object without knowing its exact type in advance. This flexibility incurs an overhead: in general, polymorphic calls must be resolved at run time. The overhead of this run time polymorphic call resolution can lead a programmer to sacrifice clarity of design for more efficient code, by replacing instances of polymorphic calls by several single-target procedural calls, removing run time polymorphism. This practice typically leads to a m...
Performance Issues in Correlated Branch Prediction Schemes
, 1995
"... Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particular, SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBP, we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves performance over (traditional) perbranch static profile prediction. We also find that SCBP improves the performance benefits gained from branch alignme...
Message Dispatch on Modern Computer Architectures
- Conference Proceedings
, 1994
"... Object-oriented systems must implement message dispatch efficiently in order not to penalize the objectoriented programming style. We characterize the performance of most previously published dispatch techniques for both statically- and dynamically-typed languages with both single and multiple inher ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Object-oriented systems must implement message dispatch efficiently in order not to penalize the objectoriented programming style. We characterize the performance of most previously published dispatch techniques for both statically- and dynamically-typed languages with both single and multiple inheritance. Hardware organization (in particular, branch latency and superscalar instruction issue) significantly impacts dispatch performance. For example, inline caching may outperform C++-style "vtables" on deeply pipelined processors even though it executes more instructions per dispatch. We also show that adding support for dynamic typing or multiple inheritance does not significantly impact dispatch speed for most techniques, especially on superscalar machines. Also, instruction space overhead (calling sequences) can exceed the space cost of data structures (dispatch tables), so that minimal table size may not imply minimal runtime space usage.
Compiler and Microarchitecture Mechanisms for Exploiting Registers to Improve Memory Performance
, 2001
"... name for a data object. Def Set of data that is defined by a statement. Use Set of data that is used by a reference. DefUseChain Marker that indicates whether reaching-definition analysis was run. DefUseSummary Mod-ref information for a function. It contains the non-local data items which are ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
name for a data object. Def Set of data that is defined by a statement. Use Set of data that is used by a reference. DefUseChain Marker that indicates whether reaching-definition analysis was run. DefUseSummary Mod-ref information for a function. It contains the non-local data items which are potentially modified or referenced by the function. ReachingDef For each statement, collection of definitions that reach the statement. Replacement For each node on the parse tree, provides a back pointer to the parent node and implements node self-replacement. AvailableExpression For each statement, the collection of expressions that reach the statement. ValueProfile For each function, all the parameters and the values they take. Labelflow Store goto and label information. LiveOut, LiveIn, LiveVariable Used during inlining to estimate when inlining should not be performed because of high register pressure. Type Name Description Table 7.3: MIRV attributes. 179 7.5.14. Po...
Strategies For The Modelling And Simulation Of Asynchronous Computer Architectures
, 1995
"... 15 Preface 19 Acknowledgements 22 1 Introduction 24 1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24 1.2 Motivation and Objectives : : : : : : : : : : : : : : : : : : : : : : 24 1.3 Structure of the Thesis : : : : : : : : : : : : : : : : : : : : : : : : 25 1.3.1 Related ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
15 Preface 19 Acknowledgements 22 1 Introduction 24 1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24 1.2 Motivation and Objectives : : : : : : : : : : : : : : : : : : : : : : 24 1.3 Structure of the Thesis : : : : : : : : : : : : : : : : : : : : : : : : 25 1.3.1 Related Publications : : : : : : : : : : : : : : : : : : : : : 27 2 The Quest for High Performance 28 2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28 2.2 Bit and Instruction Level Parallelism : : : : : : : : : : : : : : : : 29 2.3 Reduced Instruction Set Computers : : : : : : : : : : : : : : : : : 30 2.4 The Limits of Sequential Computation : : : : : : : : : : : : : : : 31 2.5 Parallel Computer Architectures : : : : : : : : : : : : : : : : : : : 32 2.5.1 SIMD : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33 2.5.2 MIMD : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34 2.5.2.1 Shared Memory MIMD Architectures : : : : : : : 34 2.5.2.2 Distributed M...

