Results 1 - 10
of
15
Efficient Superscalar Performance through Boosting
, 1992
"... The foremost goal of superscalar processor design is to increase performance through tie exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates in non-numerical applications. The general trend ..."
Abstract
-
Cited by 78 (5 self)
- Add to MetaCart
The foremost goal of superscalar processor design is to increase performance through tie exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates in non-numerical applications. The general trend has been toward supporting speculative execution in complicated, dynamically-scheduled processors. Performance, though, is more than just a high IPC rate; it also depends upon instruction count and cycle time. Boosting is an architectural technique that supports general speculative execution in simpler, statically-scheduled processors. Boosting labels speculative instructions with their control dependence information. This Iabelling eliminates control dependence constraints on instruction scheduling while still providing full dependence information to the hardwere. We have incorporated boosting into a trace-based, global scheduling algorithm that exploits ILP without adversely affecting the instruction count of a program. We use this algorithm and estimates of the boosting hardware involved to evaluate how much speculative execution support is rerdly necessary to achieve good performance. We find that a statically-scheduled superscalar processor using a minimal implementation of boosting can easily reach the performance of a much more complex dynamically-schcduled superscalar processor.
Memory Consistency Models for Shared-Memory Multiprocessors
- WRL RESEARCH REPORT
, 1995
"... The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the u ..."
Abstract
-
Cited by 61 (1 self)
- Add to MetaCart
The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems. This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the
A Trace Cache Microarchitecture and Evaluation
- IEEE Transactions on Computers
, 1999
"... As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences a ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of (1) control flow prediction and (2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 % to 35 % over an otherwise equally-sophisticated, but contiguous multipleblock fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is due almost entirely to improved prediction accuracy.
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
- In Proceedings of the 29th International Symposium on Microarchitecture
, 1996
"... To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs pro ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs provide an effective means of increasing the instruction fetch rate. We define an optimization, called block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA. We have constructed a compiler that generates block-structured ISA code, and a simulator that models the execution of that code on a block-structured ISA processor. We show that for the SPECint95 benchmarks, the blockstructured ISA processor executing enlarged atomic blocks outperforms a conventional ISA processor by 12% while using simpler microarchitectural mechanisms to support wideissue and dynamic scheduling. 1. Introduction To achieve higher le...
Facilitating superscalar processing via a combined static/dynamic register renaming scheme
- in Proceedings of the 27th Annual ACM/IEEE International Symposium on Microarchitecture
, 1994
"... A superscalar implementation of a conventional in-struction set architecture (ISA) requires N(N- 1) com-parators to determine dependencies between the N in-structions issuing concurrently [2] and 2N register file read ports to handle the 2 operands that each instruc-tion can potentially source. On t ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
A superscalar implementation of a conventional in-struction set architecture (ISA) requires N(N- 1) com-parators to determine dependencies between the N in-structions issuing concurrently [2] and 2N register file read ports to handle the 2 operands that each instruc-tion can potentially source. On the other hand, if the compiler is allowed to specify part of the renaming tag, we show that we can eliminate the comparators needed to detect data dependencies between instructions issu-ing concurrently, and we can reduce the number of read ports from 16 to about 7 without losing performance. Finally, we show that this approach more efficiently im-plements predicated execution than can be done with a convent ional ISA on a machine that renames registers.
Using Predicated Execution to Improve the Performance of a Dynamically Scheduled Machine with Speculative Execution
- In PACT
, 1995
"... Conditional branches incur a severe performance penalty in wide-issue, deeply pipelined processors. Speculative execution [14, 12] and predicated execution [7, 16, 5, 10, 15, 22, 9] are two mechanisms that have been proposed for reducing this penalty. Speculative execution can completely eliminate t ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
Conditional branches incur a severe performance penalty in wide-issue, deeply pipelined processors. Speculative execution [14, 12] and predicated execution [7, 16, 5, 10, 15, 22, 9] are two mechanisms that have been proposed for reducing this penalty. Speculative execution can completely eliminate the penalty associated with a particular branch, but requires accurate branch prediction to be effective. Predicated execution does not require accurate branch prediction to eliminate the branch penalty, but is not applicable to all branches and can increase the latencies within the program. This paper examines the performance benefit of using both mechanisms to reduce the branch execution penalty. Predicated execution is used to handle the hard-to-predict branches and speculative execution is used to handle the remaining branches. The hard-to-predict branches within the program are determined by profiling. We show that this approach can significantly reduce the branch execution penalty suffe...
Trace processors: Exploiting hierarchy and speculation
, 1999
"... In high-performance processors, increasing the number of instructions fetched and executed in parallel is becoming increasingly complex, and the peak bandwidth is often underutilized due to control and data dependences. A trace processor 1) efficiently sequences through programs in large units, call ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In high-performance processors, increasing the number of instructions fetched and executed in parallel is becoming increasingly complex, and the peak bandwidth is often underutilized due to control and data dependences. A trace processor 1) efficiently sequences through programs in large units, called traces, and allocates trace-sized units of work to distributed processing elements (PEs), and 2) uses aggressive speculation to par-tially alleviate the effects of control and data dependences. A trace is a dynamic sequence of instructions, typically 16 to 32 instructions in length, which embeds any number of taken or not-taken branch instructions. The hierarchical, trace-based approach to increas-ing parallelism overcomes basic inefficiencies of managing fetch and execution resources on an individual instruction basis. This thesis shows the trace processor is a good microarchitecture for implementing wide-issue machines. Three key points support this conclusion. 1. Trace processors perform better than wide-issue superscalar counterparts because they deliver high instruction throughput without significantly increasing cycle time. The
Program Balance and its Impact on High Performance RISC Architectures
- in Proceedings of the International Symposium on High Performance Computer Architecture
, 1995
"... Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on the balance of access and computation tasks on a typical RISC architecture, the MIPS. The MIPS programs are analyzed to ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on the balance of access and computation tasks on a typical RISC architecture, the MIPS. The MIPS programs are analyzed to find the demands they place on the memory system and the floating point or integer computation units. A balance metric that indicates the match of accessing power to computation power is calculated. It is observed that many of the SPEC floating point programs and kernels from supercomputing applications typically considered as computation intensive programs, place extensive demands on the memory system in terms of memory bandwidth. Access related instructions are seen to dominate most instruction streams. We discuss how these instruction stream characteristics can limit the instruction issue in superscalar processors. The properties of the dynamic instruction mix are used to alert computer a...
Classification-Directed Branch Predictor Design
, 1997
"... Classification-Directed Branch Predictor Design by Po-Yung Chang Chair: Yale N. Patt Pipeline stalls due to branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined superscalar processors. Two well-known mechanisms have been proposed to re ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Classification-Directed Branch Predictor Design by Po-Yung Chang Chair: Yale N. Patt Pipeline stalls due to branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined superscalar processors. Two well-known mechanisms have been proposed to reduce the branch penalty, speculative execution in conjunction with branch prediction and predicated execution. This dissertation proposes branch classification, coupled with improvements in conditional branch prediction, indirect branch prediction, and predicted execution, to reduce the branch execution penalty. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor is constructed which achieves a higher prediction accuracy than any branch predictor previously reported in the literature. This dissertation also proposes a new prediction mechanism for predictin...
βτοο: Object Oriented Language Compilation for Fine Grained Targets
, 1992
"... fiøoo (Bee-too) is an object oriented programming language integrating class abstraction with the block structured function semantics of conventional imperative languages. High concurrency and architecture independence is promoted by the programming model, the fiøoo compiler, and the fine-grained me ..."
Abstract
- Add to MetaCart
fiøoo (Bee-too) is an object oriented programming language integrating class abstraction with the block structured function semantics of conventional imperative languages. High concurrency and architecture independence is promoted by the programming model, the fiøoo compiler, and the fine-grained method-state graph generated intermediately for subsequent target-generation stages. After briefly introducing the aims of the project and programming model, this paper concentrates on the qualities of the programming model and role of the compiler in generating its intermediate-representation method-state graph. The graph is generated from expressions with imperative operations on closure-abstracted object interfaces, and describes actions and storage which approach the granularity of those found at the lowest hardware levels and hence are extremly fine-grained. Graph structure re-writing during target-generation caters for different architectures. A formal specification and concrete syntax o...

