Results 1 -
5 of
5
Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache
, 1993
"... High performance computer implementation today is increasingly directed toward parallelism in the hardware. Superscalar machines, where the hardware can issue more than one instruction each cycle, are being adopted by more implementations. As the trend toward wider issue rates continues, so too must ..."
Abstract
-
Cited by 95 (4 self)
- Add to MetaCart
High performance computer implementation today is increasingly directed toward parallelism in the hardware. Superscalar machines, where the hardware can issue more than one instruction each cycle, are being adopted by more implementations. As the trend toward wider issue rates continues, so too must the ability to fetch more instructions each cycle. Although compilers can improve the situation by increasing the size of basic blocks, hardware mechanisms to fetch multiple possibly non-consecutive basic blocks are also needed. Viable mechanisms for fetching multiple non-consecutive basic blocks have not been previously investigated. We present a mechanism for predicting multiple branches and fetching multiple non-consecutive basic blocks each cycle which is both viable and effective. We measured the effectiveness of the mechanism in terms of the IPC f, the number of instructions fetched per clock for a machine front-end. For one, two, and three basic blocks, the IPC f of integer benchmark...
Adapting Software Pipelining for Reconfigurable Computing
- In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES
, 2000
"... The Garp compiler and architecture have been developed in parallel, in part to help investigate whether features of the architecture help facilitate rapid, automatic compilation utilizing the Garp's rapidly reconfigurable coprocessor. Previously reported work for compiling to Garp has drawn heavily ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
The Garp compiler and architecture have been developed in parallel, in part to help investigate whether features of the architecture help facilitate rapid, automatic compilation utilizing the Garp's rapidly reconfigurable coprocessor. Previously reported work for compiling to Garp has drawn heavily on techniques from software compilation rather than high-level synthesis. That trend continues in this paper, which describes the extension of those techniques to support pipelined execution of loops on the coprocessor. Even though it targets hardware, our approach resembles VLIW software pipelining much more than it resembles hardware synthesis retiming algorithms.
Dynamic binary translation and optimization
- IEEE Transactions on Computers
, 2001
"... AbstractÐWe describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architectu ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
AbstractÐWe describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This work reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage. Index TermsÐDynamic compilation, binary translation, dynamic optimization, just-in-time compilation, adaptive code generation, profile-directed feedback, instruction-level parallelism, very long instruction word architectures, virtual machines, instruction set architectures, instruction set layering. æ 1
A Multiprocessor Architecture Combining Fine-Grained . . .
- PARALLEL COMPUTING
, 1994
"... A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a registertransfer level simulation methodology, this paper examines the performance of a multiprocessor architecture that combines both coarse-grained and fine-grained parallelism strategies to minimize the execution time of a single application program. These simulations indicate that the best system performance is obtained by using a mix of fine-grained and coarse-grained parallelism in which any number of processors can be used, but each processor should be pipelined to a degree of 2 to 4, or each should be capable of issuing from 2 to 4 instructions per cycle. These results suggest that current high-performance microprocessors, which typically can have 2 to 4 instructions simultaneously executing, may provide excellent components with which to construct a multiprocessor system.
Selective Guarded Execution Using Profiling on a Dynamically Scheduled Processor
"... Modern dynamically scheduled processors use branch prediction hardware to speculatively fetch and execute most likely executed paths in a program. Complex branch predictors have been proposed which attempt to identify these paths accurately such that the hardware can benefit from out-of-order (OOO) ..."
Abstract
- Add to MetaCart
Modern dynamically scheduled processors use branch prediction hardware to speculatively fetch and execute most likely executed paths in a program. Complex branch predictors have been proposed which attempt to identify these paths accurately such that the hardware can benefit from out-of-order (OOO) execution. Recent studies have shown that inspite of such complex prediction schemes, there still exist many frequently executed branches which are difficult to predict. Predicated execution has been proposed as an alternative technique to eliminate some of these branches in various forms ranging from a restrictive support to a full-blown support. We call the restrictive form of predicated execution as guarded execution. In this paper, we propose a new algorithm which uses profiling and selectively performs if-conversion for architectures with guarded execution support. Branch profiling is used to gather the taken, nontaken and misprediction counts for every branch. This combined with block...

